INTRODUCTION 
TO 


STATISTICAL THEORY 


PART 2 


(A text book for Degree and Post-Graduate Students) 


By 


Prof. Sher Muhammad Chaudhry 


B. Sc. (Hons.) M.A. (Gold Medalist) 
F.S.S. (London) 
Formerly, Head of Statistics Department, 
Government College, Lahore 


Dr. Shahid Kamal 
M. Sc (Pb), Ph. D. (U.K) 
Institute of Statistics, 
University of the Punjab, Lahore 


ILMI KITAB KHANA 


Kabir Street, Urdu Bazar, Lahore-54000 (Pakistan) 


Reprint 
Reprint 


Composed by 
TAYYAB SIDDIQUE 


Published by ` Printed at 
Markazi 
: ee ciate A\-Hajaz Printing Press, 
’ 18-A Darbar Market, Lahore 


THE F 


P.U.. 
PCS. 

C.S.5. 
‘B.Z.U. 

LU. 


SAVE BEEN USED: 
9 IATIONS HA E : 
G ABBREY 
Ol LOWING 


University of Punjab 


Punjab Civil Service 


Central Superior Services of Pakistan 


Bahauddin Zakariy 
University, Bahawalpur. 


a University, Multan 


Islamia 


A m)ty se 
cle O A Yr 
rw) 

619.5 
5! 
A 
°° gl \ 


PREFACE ‘0 Srxxye EDITION 


The sixth edition of "An Introduction to Stat! 
intly, has the same primary objectives gs its ear] 
++ contains a considerable amount of new material 


Stical Theory", revised 
ley editions. However, 
and has a new format. 

The great progress of the subject in previous decade? 


: has made it 
necessary to rewrite a number of sections and subsections in most of the 


chapter's, 20 add new sections in a few places and to add a new chapter 
covering topics such as inferences in regression and correlation meshods 
Numerous new examples and exercises drawn from past examin 


7 a x ation 
papers of various universities have been added. 


The syllabus for the B.A/B.Sc. classes of various universities include 
some basiz concepts of vital statistics. The materia] covering this part of 
the syllabus has been placed in Appendix A. 


‘ We woutd like to express our appreciation to many professors, 
students and other readers who used the previous editicns of the book. 
The changes, additions, deletions and cor.actic.1s incorporated in the 
sixth edition, are mainly based on their invaluable comments and 
suggestions. 


Students preparing for different examinations may omit the 
sections/chapters not meant for them. 


Thanks sre due to Messrs. Ilmi Kitab Khana, the publishers for 
their assistanse in the production of ¿uis e22ion. 


Suggestions for further improvement of the book will be welcome. 


Sher Muhammad Ckaudu.:v 
Chahid Kamal 


Lahore 
October, 1996 


September 10, 1968 


` Mohammad Khalid Hayat Khari of 


boox ir a eat 
É A 


REFACE To FIFTH EDITION 

T t'An Introduction to Statistical Theory, Part II" iş 

The fifth laa from the earlier editions. Due to extensive 
substantially eee place in the content of the subject and i 
changes which ne ee most of the chapters have been 
statistical ape and vastly expanded. A new chapt En aon 
completely rew has also been added. The present edition of the 
Tests th illustrative examples. A number of 


ppor‘ed Wi ; i 
T nterest has been included in order to make this 


Nonparametric 

botk is amply suppo 

exercises of practical a i 
iti more useful. 

edition all the he teachers, the students and other readers 


ik2 to thank t : 
: a and useful suggestions have been of great help 


i i ; text. 
in the improvem2nt of the tex . 
I wisk to record my thanks to my son, Shahid Kamal, M.Sc, 


(Statistics) for helping me in the solutions of the numerical problems, 
Thanks are ‘due to Messrs Ilmi Kitab Khana, the publishers and 
Ch: Manzoor Ahmad, the printer, for bringing out the book in a 


presentable way. . 
"Ssx¢gestions fer further improvement of the book will be we!come. 


` Deptt. of Statistics, ., 


Govt. Zoege, Lahore 
Sher Muhammad Chaudhry 


PREFASE TO THIRD EDITION 


In the third editie r, sce changes have been mace. These changes 
comprise the addition of the many sections presenting either new 
oi or expanded discussions and the reorganisation of the material 
seed and 14, To make this edition still more useful anc. upto 
ie ver examples both sove and unsolved, selected mainly from 

ersity Papers, have also beer added. 


I appreciate th anasaini ; ot 
e valuavle ani “helpful comments made by Professor 


steteful-io the teacher the Panjab University. I am also 


aporeciste? he aea S, 
POreClate? vis easier edition. 


it. ee ew ve 

‘ate re “once again due Gs 
n ner Jae s Yaul Kame 

“2meed Craudt, P oes 

z &<9sentatior., "wy, the printer for bringing out the 

Suggestians thesi z ; ; 
S a Crary a ; 
epartriant cf Statign . "em ent ot this edition will be welcome. 


OVL Cerege, La..ore, “aren 1973 * Sher Mukammad Chaudhry 


"statistical Ticory". It deals with some very import 


: define and derive the distributions of x7, Student! 


s, the stud < 
» the students and other readeis who have | 


A 
Haji ; eaii ; Ppa i 
E S Miaa, a ‘Teading, some misprints and inaccuracies might heve 


_ PREFACE 70 FIRS” £ oIoN 
I am hercwiih presenting the second part ag 


©- my bock "Introduction to 


: TE % ant topics s: i 
Testing of Hypotheses, Sampling Distributions, ‘Attala pi a Sampling, 
~ Variances znd 


Covariance, Experimental Designs, ete, An attempt has been ma 


3 . : er 1 ` 
discussing thcir properties and ficlds of applicability The diate - be a 
stica 


modc!s in the Analyses of Variance and Covariance haye also been di 
so that the. stuéents may have a more comprehensive conception eae 
techniques. i have also tried to exemp.ify the periinent theory at every kkaa 


- As stated in Part I, ““1e book is designed t cater for the oo 
B.A/B.Sc. (Pass and Hons.) students in particular and Fostërasuate 
students in general. This is why ths bcok has +.2n riade more flexible so 


= 


‘that the students from other d‘siptines such as Zconomics, Commerce, 


Agriculture, Psychology, etc., may vse it asa Tex by making a celect’on 
of the topics to be covered. Those students who are irte:ested in 
application only, may omit sections of ma:herratica. nature vitnout any 
difficulty and loss of continuity. l 
°” A great care has been exercised in sz ictiag, cad geng the 
examples and exercises. In addition to nuv :ro'.e solved exemples, the 
hook contains more than 275 exercises of cifferent degrees of oificulty. 
Most of the exercises have been taren from the Uni versity Examinetion 
Papers wi yn a view to enhancing its utility. ` 

I woud lize to express my gratitude to + «“e 9° vore vac inven .2d 


l my thoughts âad from whose books, paners and lectu ‘es, I reve darived 


nelb. I arı also indebted to the Literary Execv'cr of the le‘s Fr “onald 
A. Visher; F.R.S., to-D-. Frank Yates, P.x.S., and to Olive: asd Boyd 
Ltd.; Edinburgh, for permission to reprint tab:e 15.1,-16.1, 17.1, 17.2 aad 
Tables I. and II from their book; Statistica’ Tables fcr Biological, 
Agricultura: ind Mèdica! Research. aR l 
Thenks are also due to Haji Sardar Mokam.1.ad, the p.og jetz” of 
iiz:! “Vitab Khana and Haji Manzur Ahmad, th? prater, wac patiszily 
aad diligently brought out the boak: . ua A 
In the end, I may mention kere that despite = bar i. 
stateful to readers ‘who will daw my attenticn “o 


isprints an i i i a: pa and give 
misprints and other imperfections. in the presentati ; Ay 
er nea.fully atkne leds: a. 


Suggestions for impr ats ght he’ 
improvements which wilt te". pA 
i i Bz cr Maker mac Cheedhry., 


ents. of Statist! 2S). 3 i s: 
Gavt. Coliege, Lahore. . 
May 2, 1969, 


de io carefully 


sa ra guch eors., 


| 
| 
| 
| 
| 
| 
| 
f 


kua 


CONTENTS | 


Preface | 
14 SURVEY SAMPLING AND SAMPLING DISTRIBUTIONS | 
14.1. Introduction | 
14.1.1. Statistical Populations 1 j 
. 14.1.2. Advantages of Sampling ; | 
14.1.3. Sample Design and Sample Survey i f 
14.1.4. Sampling Frame r | 
j | 14.1.5. Probability and Non-probability Sampling 5 | 
I - 14.1,6. Sampling With and Without Replacement 5 g 
| 14.1.7. Sampling and Nor-sampling Errors 5 | 
| 14.1.8. Sampling Bias A | 
| | 14.1.9. Random Number Table > e] 
14.2. Probability or Random Samples 9 | 
14.2.1. Simple Random Sample 9 | 
14.2.2. Stratified Random Sample 17 | 
| 14.2.8. Systematic Random Sample 21 
| 14.2.4. Cluster Sample 22 | 
| 14.2.5. Multistage Sample 22 | 
14.2.6. Muitiphase Sample 23 | 
14.2.7. Secuential Sampling 22. | 
14.3. Non-Probability Samples 23 
14.3.1. Purposive Sample w 20 | 
| 14.3.2. Quota Sample a Yä | 
14.4. Sampling Distributions os 
14.4.1, Sampling Distribution of the Mean =: 25 | 
14.4.2. Central Limit Theorem eo AL | 
14.4.3. Sampling Distribution of Differences between | | 
| Means , wa AB E 
| 14.4.4. Sampling Distribution of a Sample Proportion as 649 
| 14.4.5. Sampling Distribution of Differences between ai 
| | Proportions ka | 
| 14.4.6, Sampling Distributior of Variances > 
| Exercises i 
| ()stanisricar INFERENCE: ESTIMATION _ 
15.1. Introduction a . 
15.2. Estimates and Estimators " 


15.3. Point Es*!mation l mt | 


15.3.2. 
- 15,4. 
15.4.1. 
15.4.2. 
15.4.3. 


15.5.1. 
15.5.2. 
15.5.8. 
15.5.4. 
` 15.5.5. 


15.5.6. 
15.5.7. 
15.5.8. 


15.3.1. Grita 


ria for Good } oint Estimators 
S 
Pooled Estimators fr om Two or More Sample 


Methods of Point Estimation r 
The Method of Maximum Likelihoo 
The Method of Moments a 

The Method of Least-Squat ; 
Estimatior by Confidence Interva M 
fidence Interval Estimate of a a Mean 
Interpretation of a Confidence Ze 

Confidence Interval for Difference of eans 
Confidence Interval for Population Proportion 
Confidence Interval for the Differences between 


Con 


Proportions 

One-sided Confidence Interval 

Sample size for Estimating Population Mean 
Sample size for Estimating Population Proportion 
Exercises í 


(16)sransnicaL INFERENCE: HYPOTHESIS TESTING 


` 16.1. 
16.1.1. 
16.1.2. 
16.1.8. 
16.1.4. 
16.1.5. 

- 16.1.6. 
16.1.7. 
16.1.8. 
16.1.9. 

16.1.10. 

16.1.11. 

16.1.12. 


16.2, 


Introduction 
Null and Alternative Hypotheses 
Simple and Composite Hypotheses 
Test-statistic 
Acceptance and Rejection Regions 
Type I and Type II Errors 
The Power of a ‘test ~ 
The Significance Level 
Test of Significance : 
One-tailed and Two-tailed Tests 
Sample size when q and B i 
are specified 
Formulation of Hypotheses sige 
Si Procedure for Testing Hypotheses 
a based on Normal Distribution 
: esting Hypothesis about Mean of a N ormal 
opulation When Ș is known. 


Tys eas Non-Normal 
siig By Potheses abet Size is large 


ut Diffore 
Means | Difference Detween 


INTRODUCTION TO STATISTICAL THEORy 
TISTICAL TEORY 


69 
82 
85 
85 


CONTENTS 
ing : xi 
16.2.5. Testing Hypotheses about a Population Proportion . - 


wher sample size is large 


46.2.6. Testing Hypothesis about Differe 
Two Proportions 


16.2.7. Testing Hypotheses about Standard D 
. Large Samples 
16.2.8. Rzlationship between Coniidence In 
Tests of Hypotheses 


16.3. Tests based on Binomial Distribution 
Exercises 
17 THE CHI-SQUARE DISTRIBUTION AND 

STATISTICAL INFERENCE 

‘17.1. Introduction 
17.2. The Chi-Square Distribution 

17.2.1. Properties of the Chi-Square Distribution 

17.2.2. The x ?-table 


nce between 
Fit: 
eviation: 


terval and 


17.3.1. Confidence Interval Estimate of o? from a 
Sample Variance i 
17.3.2. Confidence Interval of o? from Several 
Sample Variances 
17.4, Tests based on Chi-Square Distribution 
17.4.1. ` Testing Hypothesis about Variance of a 
y Normal Population 
17.4.2. Testing Hypothesis about the Equality of 
Variances of k(k>2) Normal Populations 
| . 17.5. Karl Pearson’s Approximation 
| 17.5.1. Testing Hypothesis about p’s of the Multinomial 
| -Distribution . 
_ 17.5.2. Pearson’s Chi-square Test of Goodness-of-Fit 
17.5.3. Testing Hypothesis about Independence of two 


17.6. An Aside -- Attributes 
17.6.1. Consistence 
17.6.2. Independence 
17.6.3. Association of Attributes 
17.6.4. Measures of Association 
17.6.5. Contingency Tables 


Confidence Interval Estimate of Variance ofa _ 
Normal Population ; ax 


Variables pases 


| 
Í 
1 
i 
f 
i 


IN 


ts Hypothesis of Independence in 


| . Testing H 
| A Contingency Tables av 
17.7.1 Coefficient of Contingency for a 
a contingency ta 
17.7.2. Yates’ correction 
17.7.3. An Exact Test for a 
17.8. Testing Hypothesis a 
Several Proportions 
17.9. The Chi-Square Te 
Exercises 


18 THE STUDENT'S t-DISTRIBUTION AND 
STATISTICAL INFERENCE 
18.1. Introduction 
18.2. The Student’s t-distribution 
18.2.1. Properties of Student’s t-distribution 


| 18.2.2. The t-tables 
p 18.2.3. Distribution of Difference of Sample Means: 


j Small Samples and 0, = 0, 
18.2.4. Assumptions in Using t-distribution 
18.3. Confidence Interval Estimates of Mean from 
Small Sample 
18.4 Small Sample Tests of Means 
Testing Hypothesis about Mean of a Normal 
Population when O is unknown and n< 30 
ae Hypotheses about Difference of Means of 
wo Normal Populations when o 1ž03 


but unknown 
ua . 
esting Hypotheses about Difference of Means of 


Tw i 
0 Normal Populations when © 1=0 
mAg 


@ and unknown 
(8.4.4) Testing Hypoth 
a Potheses about Two Means with 


aired Observations 
Exercises 


ble s 
for Continuity 


2x2 Contingency Table 
but Equality of 


18.4.2, 


st as a Test of Homogeneity ... 


eQDUCTION TO STATIS 1 ICAL THEORY, 
= Se 


207 


213 
214 
215 


216 
219 
222 


239 


240 
242 
243 


245 
245 


246 
249 


249 


252° 


. 257 


259 
265 


273 


274 
276 


19.2.2. The F-Tables of Areas 


19.2.3. Assumptions in Using F-Distribution 
19.3. Confidence Interval for the Variance Ratio 
19.4. Tests based on F-Distribution 

19.4.1. Testing Hypothesis about the Equality of 

Two Variances 
Exercises 


20 THE ANALYSIS OF VARIANCE 
20.1. Introduction 


One-Way Analysis of Variance 

20.2.1. Partitioning the Sum of Squares 

20.2.2. Partitioning the Degrees of Freedom 

20.2.3. The Analysis of Variance Table 

20.2.4. Alternative Computing Formulas 

20.2.5. One-Way Analysis of Variance: Unequal 
f Sample Sizes 

20.2.6. Assumptions of One-Way Analysis of Variance 
20.3. Two-Way Analysis of Variance 

20.3.1. Two-Way Analysis of Variance without Interaction 

20.3.2. Two-Way Analysis of Variance with Interaction 
20.4. Multiple Comparisons Tests 

20.4.1. The Least Significant Difference Test 

20.4.2. The Student-Newman Keul’s Multiple Range Test 

20.4.3. Duncan’s Multiple Range T<st 

20.4.4. Contrasts - Scheffe’s Method 
20.5. The Analysis of Variance Models 

20.5.1. Least-Squares Estimates of Effects In One-Way 


ANOVA 
20.5.2. Least-Square Estimates of Effects in a Two- 


Way ANOVA 
Exercises 
STATISTICAL INFERENCE IN REGRESSION 
AND CORRELATION 
21.1. Introduction 
21.2. Interval Estimation in the Simple 
Linear Regression ' 
21.2.1. Confidence Interval Estimate of Population 
Regression Coefficient 


278 
278 
282 


284 


284 
290 


295 
295 
296 
298 
300 
301 


305 
310 


311 


312 
318 


324 


324 
327 
328 
330 


331 


382 


334 
336 


347 


350 


350 


= 


41.2.2. 
21.2.3. 
21.2.4. 


21.3. 
21.3.1. 


21.3.2. 
21.3.3. 
21.8.4. 
21.8.5. 
21.3.6. 
| 21.4. 


21.5 


21.5.1. 
21.5.2. 


21.5.8. 
21.5.4. 


21.6 


21.6.1. 
21.6.2, 


21.7 
20.7.1, 


21.7.2, 


i INTRODUCTION TO STATISTICAL THEORY 
| 


. ` —— 
= Interval Estimate of O, 


the 
Confidence 

Intercept of Regre 
Confidence Interva: 


iven value 0 x y 
T Interval of an Individual Y value for 


a Given value 0 
Hypothesis Testing in 
Testing Hypothesis about 
Regression Coefficient 
Testing Hypcthesis abo 
Poptlation Regression 
Testing Hypothesis about Mean Value Hy y 
Testing Hypothesis about Population Variance 
Testing Hypothesis about Equality of Regression 
Coefficients of Two Regression Lines 

Testing Hypothesis about the Linearity of 
Regression 

Confidence Interval Estimate for 

Pcpulation Correlation Coefficient 


ssion Line 
| Estimate of Mean Value L y y- 


the Regression Medel 
B, the Population 


ut a, the Intercept of 


Hypothesis Testing about 
Correlation Coefficient 


Testing the Hypothesis that p = py (#0) 
Testing Hypothesis about Equality of Two 
Correlations ae: 


ee Hyoothesis about p = 0 
esting Hypothesis about th i 
eE 
Several Correlations dd 


Inference in Partial, 


and Regression Multiple Correlation 


Analysis of Variance for Regr ession 
ANOVA fi Q l Li € nd 
cr mp e inear Reg ression a 


- 357 


352 


352 


355 


' 357 
; 361 
362 
362 


362 


-365 


368 


370 
371 


372 
374 


376 


377 
377 


379 
380° 


380 
384 
388 


CONTENTS ——— 


ep nig 


92.1. Introduction XV 
3 ; 7 397 
22.2. One-Way Analysis of Covarian es 
the Sum of Products ce and Partitioning 
92.2.1. Alternative Computing Formulas for Sum ai 
of Products 
22.3. Two-Way Analysis of Covariance pe 
f 40 
22.4. Analysis of Covariance Models. i 
One-Way Classification 
p i 4 
92.4.1. Assumptions Made in Analysis of Covariance 
92.4.2. Uses of Covariance Analysis ns 
Exercises a 
l 414 
23 EXPERIMENTAL DESIGNS . 
ducti 
23.1. Introduction 419 
23.2. Basi: Principles of Experimental Designs 420 
93.2.1. Rar: ymization f 420 
93.2.2. Reviication 420 
93.2.3. Local Control 421 
23.3. The Completely Randomized Design 421 
23.3.1. Experimental Layout 421 
23.3.2. Statistical Model and Analysis 422 
23.3.3. Advantages and Disadvantages 424 
23.4. The Randomized Complete Block Design 426 
23.4.1. Experimental Layout 426 
28 4.2. Statistical Model and Analysis =~ 427 
23.4.3. Advantages and Disadvantages 429 
23.4.4. Randomized Complete Block Design with 
l Interaction within Blocks ; as 
23.4:5. Missing Observations in RCB Design 434 
23.4.6. Estimation of Missing Observations by Covariance 438 
23.4.7. Efficiency of a RCB Design Relative to a CR Design «.. at 
23.5. The Latin Square Design 7 j 
T Construction and Layout 443 
ap Statistical Model and Analysis 446 
ee Advantages anc Disadvantages 449 
5.4. Missing Observations in a Latin Square 451 
23.5.5. Efficiency of Latin Squares se 
23.5.6 À i weoeco Latin Sqaare - 4c 
‘6. Orthogonal Latin Squares and Graeco US 453 
23.6. Single Degree of Freedom Contrasts bee 
45 


23.7. Fartorial Experiments 


A 


23.7.1. 
23.7.2. 
23.7.3. 
23.7.4. 
23.7.5. 
23.7.6. 


ts and Interaction Uifects 
92-Factorial Experiment 
3.Factorial Experiment 


Main Effec 
Effects in & 


Effects in a 2 


ign and Analys ) 
soe treet for Computing Contrasts 


Advantages and Disadvantages 


Exercises 


24 NONPARAMETRIC TESTS 


24.1. 
24.2. 
24.3 


24.4 


Introduction 

The Sign Test 

The Wilcoxon Signed-Rank Test for 
the Faire? Observations 

The Wilcoxon Rank Sum Test for 
Independeiit Samples 

The Mann-‘Vhitney U Test 


The Median Test (Two or More SamplI2*) 


The Rens Test for Rando.t ness 


The Kolmogorov-Smirnoy Tests 


.1. The Kolmogoroy-Smirnov One-Sample Test 
. The Kolmogorov-Smirnoy Two Sample Test 


The Kruskal-Wal‘is H Test 
Exercises 

Appendix A __ Vita. Statisiics 
Appendix B _ Statistical Tables 
References 

Answers to Exercises 

Index 


is for Factorial Experiments 


N TO STATISTICAL T 
INTRO UCTION TO STATISTICAL THEORy 


455 
455 
457 
458 
462 
463 
464 


Survey Sampling and 
-Sampling Distributions | 


14.1 INTRODUCTION 


Sampling is a statistical technique which is used in almost 
every field in order to collect information and on the basis of this 
information inferences about the characteristics of a population are 
made, The values of the population characteristics are summarized by 
certain numerical descriptive measures, called parameters. The values of 
the population parameters, which are in most situations unknown, , 
would have to be estimated and to get estimates, we resort to sampling. 
The observations composing a sample are used -to calculate ‘a 
corresponding numerical ‘descriptive measure, called a statistic. Thus we 
use statistics to estimate parameters. Considerations of time and cost ere 
other reasons for sampling. Prior to introducing some of the most 
commariy used sampling methods, we proceed to some definitions and to 
a brief description of the basic concepts involved in sampling. 


la Statistical Populations. A statistical population (ór 
“nWverse) is defined -ə the aggregate or totality of all individual members 
or objects, whethe : animate or inanimate, concrete or abstract, of some 
characteristics of interest. The individual members of the population are 
called sampling units or simply units. A sampling unit’ from which 
Infor.aation is required, may be a college student, an animal, a tree, a 
househeld, a block, a town, a small area, a field, a business firm, ot A 
a n Sampling units selected from a given population r mi 
tones of size n and the process of selecting a sample: ia ne 
% tek The numerical values assigned to units of tte r 
Popuiati abs random variable X, and the distribution o f 

ton d l 


` 


istribution. 


—- 


ther finite or infinite depending upon Wheth; 
ountable number of units. T 


€ 
: rH FAY 
The total number of units in a finit 
plation and is denoted by N, Th 
he population of all points ual 
f pressures at various points in the atmosphere, et 
te units such as trees, households, students, ete, | 
lation while a hypothetical population consists , 
hich an event can occur, e.g. all possible throw 
does not exist in a ¢oncrete manner but jį 


f all students 11 
s, the population 
nite population. 
led the size of the pop 
ite population are t 


population © 
motor driver 
examples of fi 
ulation is ca 
ei of an infin 
line, the population 0 
A population of concre 
called an existent popu 
all conceivable ways in w! 
of a die. Such a population 
only to be thought of. , 
Furthermore, a sampled population is that from which a sample i 
chosen, whereas a population about whìch we wish to draw inferences, į 
called a target population. The following two examples may suffice t 
illustrate the difference between a sampled population and a targe 
population. 
Suppose we desire to know the opinions of college students in th 
province of the.Punjab with regard to the present examination system 
„Then our population will consist of the total number of students in al 
the colleges in the province. Suppose on account of shortage of resource: 


~ or time, we are able to conduct such a survey only on five college 


scattered throughout the province, say, situated in large urban areas. In 

such a case, the target population consists of the students of all ‘the 

: colleges in the province, while on the other hand, the sampled population 
consists of the students of five college, from which the sample of students 

ilo As long as the students of these five colleges att 

Sea mee i. students of all the colleges, the results would be 

cab ce eae e ay Similarly, the sampled population maj 

combat ado istrict hospitals and the target population maj 

number of patients in the province. It is of some 


importance to emphasi 
that its results ia ri that the sampled population should be such 


results cannot be ana to the target population. In case, these 
the sampled population the target population, they hold good fot 


A population 


; is discre 
Population is coun sil: 


table, otherwi 


A complete co 
values x. Coverage of th 


en the number of units comprising tht 
Se it is continuous. 


D u X y for X, Sh Sampling units will yield numerical 


ated with each y e X will refer to some characteristic ° 


. sampli ee it j 
pling unit js A student, ey R a population. If, for example, the 


n X might refer to age, height, weight 


SURVEY SAMPLING AND SAMPLING DISTRIBUTIONS 3 
SURVEY:SAMPLING AND'SAMPLING DISTRIBUTIONS — — — — % 


marks obtained, attitude towards present system of examination, and the . 


like. 


The two basic purposes of sampling are (i) to provide sufficient l 


information about the characteristics of a population without examining 
every unit of the population, and (ii) to find the ‘reliability of the 


_ estimates derived from the sample. We find the reliability by computing 


the standard error of a’ statistic and if possible, its exact sampling 
distribution. The definitions of these terms appear-in later sections. 


14.1.2 Advantages of Sampling.. The important advantages of 


sampling over complete enumeration are briefly stated below: 


(i) 


v (ii) 


(iii) 


(iv) 


(v) 


(vi) 


Sampling saves money as it is much cheaper to collect the 
desired information from a small sample than from the whole’ 
population. is an 


Sampling saves a lot of time and energy as the needed data are 


 collectedeand processed.much faster than census information. 


And this is a very important consideration in all types of 
investigations or surveys. - 


Sampling provides information that is almost as accurate as that 
obtained from a complete census; rather a properly designed and 
carefully executed sample survey will provide more accurate 
results. Moreover, owing to the reduced volume of work, persons 
of higher calibre and properly trained can be employed to analyse 
the data. : 


Sampling makes it possible to obtain more detailed information 
from each unit of the sample as collecting data from a few units 
of the population (i.e. sample) can be more complete and 
thorough. 


Sampling is essential to obtaining the data when the 
measurement process physically damages or destroys the 
sampling unit under investigation. For example, in order to 
measure the average lifetime of light bulbs, the measurement 
process destroys the sampling units, i.c. the bulbs, as they are 
‘used until they burn out. A manufacturer will therefore use only 
a sample of light bulbs for this purpose and will not burn out all 
the bulbs produced. Similarly, the whole pot of soup cannot be 
tasted to determine if it has an acceptable flavour. l 


Sampling may be the only means available for obtaining the 

needed information when the population appears to be infinite or 

is inaccessible such as the. population of mountainous or thickly 

forested areas. In such cases, taking a complete census to cellect 

data would ncither be physically possible nor practically feasible. 
7 "s 


Si 


INTROCUCTION TO STA TS ica 
Tı 


smaller "non-response", followin Eory 
The term NON=TESDONSE means the a 
«(nich is f information from some sampling units ing 
- qvailablt . any reason such as canine to locate op Meag : 
als, not-at-home, etc. Ute 


—— a ling h * ay 
Wi Samp"? uch easier. 


d to obtain some of the een 

$ 

(viii) o. l 

i ortant advantage of samp ling is that it PLOVides 

of-reliabilit y for the sample estimates and this i 
es of sampling. 


The most im 


jd measure ofrena? 
A of the two basic pu? pos 


e Design and Sample Survey. A sample de 
plan concerned with all principal steps taken ti 
le and the estimation procedure. These Steps an 
f conducting the sample. The term survey-hy 
of collecting information to meet a definit 
od out by a sampling method, it is called: 
ample survey are to: 


sliz) 


14.1.3 Sampl 
a definite statistical 
selection of & samp 
formulated in advance 0 
“heen defined as a means oF 
need, When a survey 1S carried 
sample survey. The main steps 1n a $ 
(i) ` clearly state the objectives of the survey; 
Gi) define the population we wish to study as clearly as possible; ’ 


(iii) construct the sampling frame by clearly defining the sampling 
units; : 

(iv) choose an appropriate sample design and proper sample size} 

(v) organize a reliable field. work‘to achieve the objectives of th 
survey; - : 


(vi) summarize and analyse the data. 


i , , ist ort 
14.14 Sampling Frame, A sampling frame is a complete ai 


map that contains all the N sampling units in a population. A “ee 
Uist of the names of all the students in the Government College, 


on, say, March ‘isthe frame, A list of all households in a city, 4 map 


avi Wi "il 
te hig all fields, ete, are other examples of the frame 
, ent sta reasonably good frame are that the frame shou 

À not Contain inaceur: 
(ii) te complete and 
. Should CoVer.th 
(ii) be free fi 
and 

i 

(iv) be ag to-date a ; 

y ‘Possible at the time of use. 


ate sampling units, n 


g nits 8 
exhaustive, ie. should contain a'l Y 
e whole of the population, 


oe pitt 
. i ing” 
tors of omission and duplication of sampl | 


TOM er 


Sign is ° 


SURVEY SAMPLING AND SAMPLING DISTRIBUTIONS 3 5 


Most of the frames used for sample surveys do not meet all these 
requirements. 


14.1.5 Probability and Non-probability Sampling. Sampling 


methods are broadly classified as: Probability Sampling and Non-. 


probability Sempling. When each unit in a population has a known non- 
zero (not necessarily equal) probability of its being included in the 
sample, the sampling is said to be. probability sampling. A probability 
‘sampling is also called random sampling. The major types of probability 
sampling are Simple random sampling, ‘Stratified random sampling, 
Systematic sampling, Cluster sampling, etc. The advantage of probability 
sampling is that it provides a valis estimate of sampling error, 
Probability sampling is widely used in various areas such as industry, 
‘agriculture, business, etc. . : 


A non-probability sampling, also called non-random sampling, is a 
process in which the personal judgement determines which units of the 
population are selected for a sample. The disadvantage of non-probability 
sampling is that the reliability of the sample results cannot be 
determined in terms of probability. Non-probability sampling techniques 
include Purposive sampling and Quota sampling. 


14.1.6 Sampling With and Without Replacement. Samples 
may be selected with replacement. or without replacement. Sampling is 
said. to be with replacement when from a finite population a sampling 
unit is drawn, cbserved and then returned to the population before 
another unit is drawn. The population in this case remains the same and 
a sampling anit might be selected more than once. If, on the other hand, 
a sampling unit is chosen and not returned to the population after it has 
been observed, the sampling is said to be without replacement. Here the 
sampling units cannot be selected again for that sample as the units 
drawn are not replaced. Though the successive drawings become 
dependent, but for all practical purposes, they are considered as 
independent drawings. When sampling is performed with replacement, a 
finite population can theoretically be considered as an infinite population 
and a sample of any size can be drawn because the population is not 
exhausted. But when sampling is done without replacement, the sample 
size cannot be greater than the population size. 


14.1.7 Sampling and Non-sampling Errors. A sample being 
only a part of a population cannot perfectly represent the population, no 
matter how carefully the sample is selected. This results in a difference 
between the value of sample statistic and the true value of the- 
“orresponding population parameter. Such a difference is called 
Sampling error for that sample. If, for example, x is the mean obtained 


—— 


INTRODUCTION TO STATISTICAL THEORY 
6 


frora a s 


mple n size n and jt is the corresponding population 
á 7 -_ rY 
parameter, then the difference between x and His sam 


pling error, that is 


sampling error = x — H 
' As the sample size increases, the sampling error is ts Sr a 
complete enumeration (census), there is no sampling error a 
equal to pi. Sampling error is measured by what is Eaves as re is i - 
"which is related to the variance of the sample statistic. The ema ler the 
variance, the greater the reliability of the sample results will be". 

Aside from sampling errors which arise because a sample comprises 
only a portion of the population, there are erre ~ which occur at the 
stages of gathering and processing of data, 1 — .dless of whether a 
sample or a complete census is taken. These errors are called non- 
sampling errors. Non sampling errors include all kinds of human errors.- 
faulty sampling frame, biased method of selection of units, bias in 
response, non-response to mail questionnaires, érrors of observation and 
measurement, processing errors such as errors in editing and coding, 
missclassification `of observations, etc. These errors can be avoided 


through the proper selection of questionnaires, following up the non- ` 


Tesponse, proper training of the investigators, correct manipulation of 
the collected information, etc. i 


14.1.8 Sampling Bias, In survey sampling, the word bias means 
a systematic component of error which deprives a survey result of its 
representativeness. Bias is different from a random error in the sense 
that the random errors balance out in the long run while bias is 
cumulative and does not become less as the sample size increases. Bias is 
introduced by the following methods of selection: 


(i) Deliberate Selection. Bias originates from deliberate selection 
which is based on Personal judgement of what is representative, 


Gi) Substitution, Sometimes it becomes difficult to make contact 


(ii) Incomplete Coverage. Bias also emerges when we fail tö cover 
the whole of the selected sample. For example, we select a 


(iv) Haphazard Selecti 
l ọn. Haphazar 
introduce bias as every h i A inha 
randomness in his choices, 


suRVEY SAMPLING AND SAMPLING DISTRIB 


ay 


a UTIONS 
a e u . . i 
~ In i q ate Interviewing, Bias e 
interviewing is hasty, incomplete d ri ‘aay = 
en t 
In order to draw valiq Conclusions " 


jas must be avoi i 
bias m oided. This end is Sources of sa 


5 achi mpli 
entirely at random—a wel] defined ¢ aea if the Sample iş ana 
į every unit has-a known non-z oncept in Statistics meani js 
- : Ing that 


ro Probabilit i 
` Y of being ; 
from selection a. & included in the 


ar Proce ; x 
types. of investigations or ple. It is Interestin ton aral bias, aS 
Ypes. ' Surveys, a certai & to note that 
however, tolerated, co “ain amount of ee 


Sampling bias ig, 


14.1.9 Random Number Table. A tap) 
° e of ran 


contains a series of the digits 0 1, 2.9 
> d, 25.9) produced þh 
y 


3 do 
device which ensures that m numbers 


(i) each of these dio; 
se digits has an equal Probability of 1 


sits 00,01, 19 11 


probability of} -» 99 has the 
yo 100 °f Occurrence and so on. Same 


iC ‘ y Ti 
Sit aig: b Tippett, Comprise 
ur, (Tracts for Computers 
r9; 


(ii) Random Sa g p 
mplin Numbers blished by 
2 u 
comprise 100,000 digits Printed In tw 


Kendall and Smi 
Computers, No. XXIV). or 


os and fours, (Tracts for 


umbers, i E 
Million Mi Ma by the RAND Corporation (A 
these t b) l 
tegai ables have bee 
ents of cs n adequately tested 
: edi on to ens 
considered A al probability and independence ean Ta 
These H ‘ently random for selecting samp] e ae 
i mples 
tr 
ts ow, d ve Fe by arbitrarily Selectin 


? 


STICAL they = li 
J1 Random Numbers Ry "SURVEY SAMPLING AND SAMPLING DISTRIBUTIONS 9 
8 Table 14. 94 62 67 86 24 98 33 41 i9 E. These days, the Computers are programmed to generate random 
' 94 02 82 90 23 07 7962 87 80 n numbers. Such random numbers are called pseudorandom numbers. 
3p 43% 99 00 65 26 Fo 317596 49 28 24° i i : 
‘ 53 38.06 A 46.08 72 : a g8 61 38 44 12 me 32 92 85 88 65 : 14.2 PROBABILITY OR RANDOM SAMPLES 
58 5 51 2 94 9 24 02 : P Dat 3 š 
, 38 4 A 32 69 i Ji g2 8146 74711 , 113o | A sample is called a random sample if the probability of selection for. 
ae 97 55 26 50 80 8956 38 15 70 11 48 43 4045 gg 98 | each unit in the population is known Prior to sample selection, The 
4 gy 17 69 11 a 28 06 00 61 54 13 = e ~ 12 23 99 | important kinds of random samples which differ in the manner in which 
My 5 92 21 82 = 10 07 82 04 59 .63 69 ne on” ap = 15 83 80 the sampling units are selected, are discussed in the subsections that 
A 07 26 13 A a 54 44 82 00 a z a 13 26 37 70 a 97 follow: 
1 : 57 . . 
5g 54 16 : 37 6148 64 56 26 S BY 84. ai | 14.2.1 Simple Random Sample. A sample is defined to be a 
| 34 85 27 57 99 16 96 56 30 33 be = 23 53 oi = 98 | „simple random sample (SRS) if it is selected in such a manner that (i) 
i 03 92 18 27 46 37 75 41 66 48 86 i ia 85 42 28 88 6} 83 each unit in the population has an equal probability of being included in 
62 95 30 27 R 60 21 75 46 91 98 3 i 61 02 57 55 66 43 u i the sample and (ii) each possible sample of the same size has an equal 
l 08 45 93 x a 45 44 75 13 90 24 sx 19 75 12 76 3943 i | l probability of béing the Sample selected. ~ ‘ 
07 08 55 1 10 19 34 88 © 15 $ | : j a a 
66 51 : | Suppose a finite Population contains N 
eae 35 19 11584926 50 11 17°17 76 86 31 5729 | 
72 84 71 14 35 


| ic to be selected. If we sam 
| 53 75 45 69 30 96 73 89 65 703) eile: 
| BB 78 28 16 84 13 > : - 25 12 74 75 67 60 40 60 81 39 possible samples of size n tha 
| 28 
\ 45 17 75 65 51 
| 


, as the first unit of 
119425 71.96 16 16 88 68 64 36 4 45 the sample can be selected in 
96 76 28 12 54 


econd unit can also 
o Se ee ae be selected in N ways and so on, When we sample without replacement, 
ae 5 54 35 02 42 35 48 96 a2 then the number of all possible samples when the order of the units is 
aon “4 ee aa th = i a 72 24 58 37 52 1851 |. considered, is the number of permutations of n units from N, 
| ` 22662215 86 -26 63 4 28 : 95 67 47 29 83 94 69 40 06 0- NP, = NN - DaN -n 4] 
| ss ans i a 12 93 48 98 57 07 23 69 65 95 39 > A t 
| 5 ‘i A - 84 43 8994 3645 56 69.47 07 41 90 22 91 


A | the number of co 
84 37906156 70102398 05 85 11 34 76 GO 76 48 45 3460. | 


units and a sample of n units 
ple with replacement, the number of all 
t could be selected is N” 


N different ways, the s 


e population of N 
Í 1 
‘ : ; 1 33 34 91 58 93 units, ie., M = > . Thus there are ( samples that could be 
36 67 10 08 23 98 93 35 08 86 99 sa n = 30 14 78 56 27 n) n!(N-n)! nj i 
| EA e ea - - a a8 7 06 93 91 98 94 05 : | selected and these samples occur with equal probabilities. 
| 10 15 83 87 60 79 24 31 66 5 887 ; i ' 
| 55 A 68 97 65 0373521656 00 53 55 90 27 33 42 293 | As an illustration, suppose we wish to select random samples of size 
53 81 29 1 5 71:34 62 33 74 82 14 53 73 19 09 | 2 from a population of sa , 5 students, identified as A, B, C, D and E. If 
51 86 32 ie i a > = i s 40 14.71 94 58 45 94 = : 43 We sample with replacement, then there are (5)? = 25 possible samples, 
5 Í Ş r 
35 91 7102913 8003540727 96 94 78 32 66 k a 90 76 14 which are listed below: é . 
37 71 6795 13 20 02 44 95 94-64 85 04 05 72 g 14 52 98 AA BA C4 DA EA 
» 89°66 13 83.27 92 79 64 64 72 2854 96 5384 4 or 097 Pe ‘i ag om- a 
| A i 45 5 f 
| 02 98 08 45 65 13 95 gg 4184 93 07 54 72 59 21497 ge 
| 49 83 43 48 35 8288 3g 69 96 72 36 04 19 76 4745 4 130 AC BC cE DC EC 
| 3 Oy 22 82 48 40 80 81 30 37 34 39 23 05 38 ia r 90 4789 ip pa ae "n ED 
| 17 30 88 11 4491 14 8 3315 9 7595 
z ; 8 47 8923 30 6 9 54 01 75 ‘ E EE 
| i9 F A ie teh toss az 87 00 92 68 40 > mbes qd i a Ms a 
| able 14.1 is tak $ dom Nu 
| ; $ en fr Rando 
| Fisher and Yate rom Table XXXIII, 


f) 
i jcultural 0 

Medical S: Statistical Tables for Biological, Agri aw 
reproduced by re Poblshed by Oliver & Boyd j 
ee by Permission of the authors and publishers- ‘i. 


to the student selected on the first 


orrospond sponds to the student chosen on the 
r cor ; 


where h , . 
i $ : = eo 
a dra ia g of the 25 samples, SO P(A) 25 . 


i hat m ` ; 
= P(C) = ty of being in the sample, 


similary, PC) > egual probabili 


pulation has @ k equal probability af = of 


Fu the m i l S has an 
ore, each of the 25 samp f 
y vy A e 
being selected. | 


pling is done witho 


When the sam 10 possible distinct samples, which are: 


| disregarded, there are (g 


listed below: 
AB, AC, 


4 
in 4of the 10 samples, SO P(A) = 7 


AD, AE, BC, BD, BE, CD, CE, DE 


Now A appears 


‘ = 4 f $ 
(D) = = —, showing that each unit 
Similarly, P(B) = P(C) = P(D) = P(E) 10 , 


; ‘ 4 1 
; iji — d each of the 10 
in the population has an equal probability fa = , 


1 A 
an ility | = elected. A 
distinct samples has the same probability o of being s 


simple random sample is also known as unrestricted random unne a 
important advantage of simple random, sampling is that it p ce 
unbiased estimates of the population mean, population totals an 
sampling variance of the estimates. 


i le 
Selection of Simple Random Sample. A simple random samp 
can be selected by the following methods: 


: jon 4 
(Gi) Goldfish Bowl Procedure, Allot to each unit in the population 


. 5 -nâ '’ 
_ different serial number from 1 to N and record each number oP 


3 de 
card or a slip of paper. Place these numbered cards or the nl 
slips of paper in a bowl or a basket and mix them one e 
Then draw out blindly the desired number of cards or the 10 


each drawing, T 
numbers appearing 
‘hen included in th 
obtaired, This meth 


+s 
od of random selection works well for 


; , j afte! 
slips of paper one by one for the sample, mixing thoroughly he 


: 0 
he population units corresponding A are 
on the selected cards or slips of aar is 
€ sample and the desired informat’ al 


Yi 
INTRODUCTION TO STATISTICAL THEORy j 
| 
} 
| 
| 


| 
| 


SURVEY SAMPLING AND SAMPLING DISTRIBUTIONS 11 
- . . . . 

populations but becomes practically impossible when the 
population size is quite large or infinite. This difficulty is 
overcome by a procedure, similar to drawing slips from a basket, 
that uses a random number table. 


(ii). Using a random Number Table. Assign a number from 1 to N to 
each of the N units in the population. Consult a table of random 
numbers and select randomly a starting point in the table. Read 
digits in groups of two, three or more “according to the largest 
number assigned to a unit in the population, from the table 
vertically, horizontally or diagonally. Record the number, 
discarding a number that is greater than N and that appears a 

_ second time if sampling is without replacement. Continue this 
process of selection until the desired sample size is reached. 


For instance, if our population consists of 100 units, we may assign 
two-digit numbers from the range 00 to 99. (We start with 00 so that the 
last unit is assigned 99, a two-digit number), As .the last number 
assigned to the population has two digits, so we read the random 
numbers consisting of two digits. If, for. example, we read the numbers 
53, 63, 35, 98, 02, etc. in the random number table, (Table 14.1; top left 
hand corner, reading down the column) we would include in the random 
sample those units of the population whose numbers are 2, 35, 53, 63, 
98, etc. 


Sampling from a given frequency distribution is performed by 
assigning a sampling number to each frequency and then drawing a 
random sample with the help of random numbers, Example 14.2 
illustrates the procedure. In case of sampling from a given probability 
distribution, we construct a cumulative probability distribution either by 
calculating the probabilities associated with each value (or each class) of 
the variable’ or from the given probabilities. We then need to assign 
sampling numbers. To do this, we ignore the decimal point from the 
cumulative probabilities in order to have all (N) members to choose- 
from. The range of sampling numbers to be used depends upon the 
number of decimal places used in the caléulation of probabilities. The 
technique is demonstrated by Examples 14.3 and 14.4, 


(iii) Using a Computer. Computer programs are available that” 
provide random numbers. The sampling units correspondirg to 
the random numbers are included in the sample. 


Example 14.1. Assume that a population consists of 5 students 
and the marks obtained by them in a certain statistics class are 20, 15, 
12, 16 and 18. Draw all possible random samples of two students when 
sampling is performed (i) with replacement, (ii) without replacement. 


Calculate the mean marks for each sample. 


| disregarded, there are |g 


INTRODUCTION TO STATISTICAL THEORy 


ds to the student selected on the 
m corresPOT sponds to the student chosen on 
ter CO! ' ‘ 
ond let 
EA 
25° 


fi rst 
the 


samples, so P(A) = 


furthermore, 


being selected. 
When the 


`. done without replacement and the order jg 
ling is tes 
R \ = 10 possible distinct samples, which are: 


listed below: T AE, BC, BD, BE, CD, CE, DE 


> 4 
Now -A appearsin 4 ofthe 10 samples, SO P(A) = i0 


r . 7 4 i ; 
= = —, showing that each unit 
Similarly, P(B) = P(C) = P(D) = P(E) 10’ showing : 


s i å : 
ili TA h of the 10 
in the population has an equal probability (; 5) and eac 


f, ili 1 i lected. A 
distinct samples has the same probability 8 of being se 


simple random sample is also known as unrestricted random ar as 
important advantage of simple random sampling is that it pro d of 
unbiased estimates of the population mean, population totals an 
sampling variance of the estimates. 


le 
Selection of Simple Random Sample. A simple random samp 
can be selected by the following methods: 


i r - i a 
() Goldfish Bowl Procedure, Allot to each unit in the population 


: P -òn 
different serial number from 1 to N and record each number Òn 


card or a slip of paper, Place these numbered cards or the a“ 
slips of paper in a bowl or a basket and mix them hier wal 
Then draw out blindly the desired number of cards or the fo fter 
slips of paper one by one for the sample, mixing thoroughly 4 the 
each drawing, The Population units corresponding t° -e 


; pe 
A vaa on the selected cards or slips of rte is 
‘hen included in the g : informa $ 
obtained. This methoa Ple and the desired mal 


1s 
thod of random selection works well oa 


| Git) 


SURVEY SAMPLING AND. SAMPLING DISTRIBUTIONS 11 


impossible when the 
nite. This difficulty is 
awing slips from a basket, 


populations but becomes practically 
population size is quite large or infi 
overcome by a procedure, similar to dr 
that uses a random number table. 


(ii). Using a random Number Table. Assign a number fron 1 to N to 
each of the N units in the population. Consult a table of random 
numbers and select randomly a starting point in the table. Read 
digits in groups of two, three or more according to the largest 
number assigned to a unit in the Population, from the table 
vertically, horizontally or diagonally. Record the number, 
discarding a number that is greater than N and that appears a 
second time if sampling is without replacement. Continue this 
process of selection until the desired sample size is reached. 


For instance, if our population consists of 100 units, we may-assign 
two-digit numbers from the range 00 to 99. (We start with 00 so that the 
last unit is assigned 99, a two-digit number). As the last number 
assigned to the population has two digits, so we read the random 
numbers consisting of two digits. If, for. example, we read the numbers 
53, 63, 35, 98, 02, etc. in the random number table, (Table 14.1; top left 
hand corner, reading down the column) we would include in the random 


sample those units of the population whose numbers are 2,°35, 53, 63, 
98, etc. 


Sampling from a given frequency distribution is performed by 
assigning a sampling number to each frequency and then drawing a 
random sample with the help of random numbers. Example 14.2 
illustrates the procedure. In case of sampling from a given probability 
distribution, we construct a cumulative probability distribution either by 
calculating the probabilities associated with each value (or each class) of 


. the yariable or from the given probabilities. We then need to assign 


sampling numbers. To do this, we ignore the decimal point from the 


cumulative probabilities in order to have all (N) members to choose- .. 


from. The range of sampling numbers to be used depends upon the 
number of decimal places used in the calculation of probabilities. The 
technique is demonstrated by Examples 14.3 and 14.4. 


Using a Computer. Computer programs are available that” 
Provide random numbers. The sampling units correspondirg to 
the random numbers are included in the sample. 


Example 14.1. Assume that a population consists of 5 students 
and the marks obtained by them in a certain statistics class are 20, 15, 
12, 16 and 18. Draw all possible random samples of two students when 
sampling is performed (i) with replacement, (ii) without replacement. 
Calculate the mean. marks for each sample. 


b 


e 
0. 
T 
2 
3 
4 
5 
6 
8 
9 


v = 

= i o 
& PRP DP DY Dee we we SE Be Se Sb e 
oF OD KF SF G&A GS E 


tudents 
andom opulation 1S 


e 
m'this POP 
i “ selected firs 


E, B 


E,D 
“EE 


e identifie 
samples 


INTRODUCTION TO STATISTICAL THES 
d as A, B, C, D ang E. The 
f 2 students which can be 
(5)? = 25. Let Xia 
and Xp the marks of th 
the possible random samp 


Marks (X) 
20 
17.5 
16 

18 

19 
17.5 
15 

13.5 
“15.5 
16.5 


13.5 
15.5 


16.5 


R (i) the 
Select, 

Enote the 
e Student 
les of size 


SURVEY SAMPLING AND SAMPLING DISTRIBUTIONS 


(ii) the number of random sam 


f [9 
without replacement is (3) = 


marks are given below: 


Sample Mean 
Marks (X) 


Example 14.2, 


The following frequency distributi 
ages of a population of 1 


,000 college students: 


on gives the 


1,000 
Using a random numb 


er table, select a simple random sample of 20 
mean age and compare it with the population 
mean age. i . 


First of all we assign a number to each of the 1,000 students from 
the range 000 to 999. (We start with 000 so that the last student is 
assigned 999, a thrée digit number). The 6 students corresponding to the 
first class (x = 14) are assigned the numbers 000, 001, 002, 003, 004 and 
005. The next class (x = 15) has 61 stu 
61 numbers 006 to 066 inclusive, and so forth. The sampling numbers 
can conveniently be assigned by compiling a cumulative frequency 
column. The assigned nu 


mbers are shown in column 3 in the following 
table: 


ples of 2 students that can be drawn 


10. These samples with values of mean 


dents to whom we assign the next’ ` 


x 


aan nE 


Assigned | fX; 
Numbers 


No. of 
Students 
84 


915 
4320 
8347 
2754 
285 


000 — 005 
006 — 066; 
067 —.336 
337 — 827 
828 — 980 
981 — 995 
996 ~ 999 


80 


634, 982, 026, 645, 850, 585, 348, 039, 629, 084, 070, 018, 728, 887, 451 
967 and 433. Our sample students are those who correspond to these 


Frequency Distribution of Sample Dota 
Age (x;) 


Fal 837° 
oe Yims = SE 


l 16785 
bas E ie, yi 
L N DAX, ~ [009 = 16.78 years, 


= 16.85 years, aná 


a table of random numbers to select a sample of 20 
college: studénts at random by finding 20 three-digit numbers between ' 


AMPLING AND SAMPLING DISTRIBUTIONS 15 
P(X = x) = ETB fi 


x! > forx = 0, 2,8 


303 (2) 
869 (5) 
987 (7) 
249 (2) 
158 (1) 


INTRODUCTION TO STATISTICAL THEO, 
Y 


With ie help of random ae res 
| | 7 p g z sues ` 
i gistribution with u = 60a | 

ym 


| a the range of classes is pi + 30. 
ormal distr! EERE for the classes is 60 + 3(2.5) 

- each with a class-interval h =- 
lasses, 


d ey 
9,5 to 61.. Using 8 calculated (see Example 9.21 page 407 of p 
5 


probabilities 

We assigh 4- 
the probabilities : 
assigning number 
assigned numbers 


art Di 

‘ he range 0000 to 9 | 

i ing numbers from tl iy 999 as | 
digit sampling places (the decimal point ignoreg While 


The classes, the cumulative probabilities ang the 
s). the 


are shown below: . l s 
P(Z<z) (2) Assigned Numbers 


0000 — 0012 
0013 — 0138 
0139 — 0807 
0808 — 2742 


Upto 52.5 


52.5-54.5 
54.5-56.5 
56.5-58.5 
58.5-60.5 
60.5-62.5 
62.5-64,5 
‘| 645-66,5 
Over 66.5 


2743 — 5792 
5793 — 8412 
8413 — 9640 
9641 — 9952 
9953 — 9999 


; ; : digit , 
We now select a random sample of size 25 by finding 25 pei 
numbers. For this purpose we use the-four columns 11-14 o 


h ample 
14.1, page 8. The random numbers and the classes to which each samp 
value corresponds, are listed below: 


6132 (60.5-) 7150 (60.5-) 5799 (60.5-) 1911 (56.5—) 6606 (60.5-) | 
9900 (64.5-) 4822 (58.5-) 3775 (58.5-) 1352 (56.5-) 2663 si | 
DO G45) O110625-) B021 (G0,5-) 2840 (58.5~) 2322 (565) 
HHO1 (60.5) 5154 (685-)- 4544 (68,5~-) 2901 (56.5-) 6020 na 
0191 (54.5-) 6148 (60,5-) 5110 (58.5-) 2402 (56.5-) . 4389 (58.57 | 


|e | 
Finally, the samp” | 
given by: ü information in a table and get 


We arrange this 


SURVEY SAMPLING AND SAMPLING DISTRIBUTIONS 17 
a ar TNS 


14.2.2. Stratified Rar:dom Sample. A sample of size n is defined 
to be a stratified random sample if it 


has been divided into a number o 
popuiations, called strata, such tha 
random from each stratum. State 
` population containing N units be di 


e of size n will be composed of the simple ° 
random samples of predetermined Sizes n4, No, ..., ny <n; = n) drawn 
independently from the strata 1,2, 


such as urban and rural; to the natura! 
, Sex, family size, occupation, education, 
» etc. The advartages of stratified random 
- sampling are low cost, greater accuracy -arc a better coverage. Stratified 

random sampling is used when (i) the variations among strata are 


greater than the variations within strata, (ii) information about some 
parts cf the population is desired. — : 


The purpose of stratified random sampling is three-fold. Firstly, the 
strata obtained by subdividing the heteroger‘ecus popv'ation in“c 
homogeneous groups, adequately represent the Populatic.> zo the 
information concerning individual stratum is gathered. Sec:-dly, it 
Provides improved estimates of the population characteristics. Thirdly, it 
reduces the variance of the estimator. 


Allocation of Sample Sizes. Dy allocation of a sample we mean 
the way the total sample size n is distributed among the various strata 
into which the population has been divided. Faur methods of allocating 
the Sample numbers are available. They are: 


Y 


(a) Equal Allocation. The allocatior is called equal when *-n each 
“ratum, equal number of sampling units is selected. That is the total 
Sample size n is distributed equally among all the 2 strata. Thus he 
Stratum sample size n; for equal allocation is . 


This is the simples; method of allocation. 


INTRODUCTION TO STATISTICAL + 


Sie, ea cant HEOR, 
ation. The allocatior is said to he foa Ry 
nis distributed among the different i tiong 


À 1 
f strata. In other words, the allocating è 


18_ fi ional Alloc 
ample size 
to the sizes ° 


otal S 
roportion n 
jona! 1 ; 
proportio A me) for i= 1,2,...,k 
t N- f 


opulation size of the ith stratum, n; is the ith straty | 
he p e Of sin | 


Nis the total size of the population. A sampl F 
: drawn by random numbers and investigated. This wa 


ae a 
next simplest methcd and it is the most frequent 


where N; is t 
sample size and, 
from stratum | 1S 


of hie 4 a dvantage of proportional allocation is that it does ny 
used method, 


‘ve information either. on the stratum variance or on the costs "| 
sequire : | 
a ling units in different strata. N = 
samp Optimum Allocation. The-allocation is called optimum whe ts 
aia size n is allocated among the different strata in such a vy 
re for a given cost of selecting the sample, the variance of th 
„estimated mean X,), i.e. Var(X y) is minimized. The stratum sample siz 
n for this method of allocation is 


ue NG; Ne; 
j “le , 
a XN;0; / Ve; l | 
where N; is the population of the ith stratum, 
` G; is the stratum standard deviation, and 
“c; is the cos* of surveying one unit in the ith stratum. at 
This method of allocation tells us that we should select a larger si | 
sample from a given stratum if | 
(i) the stratum comprises a larger number of units, i.e. N; is larger. | 
G) the Variation within the stratum jg greater, i.e. O; is bigger, — 
(iti) the sampling units in the stratum are less costly to, measure, ba 
¢; is smaller, $ TE 
When informati 
available, we either 
allocation, 


i 
f 


| 


— . jg nd! 
on on the stratum standard deviation, 9; ae 
estimate it by S; or we may use the propo™! 


(d) Neyman Allocati 


Neyman (1894-1981) 
minimizes the 


sample size p 


f 
3 bye 
on. This method of allocation was proposes al 
“>U in 1934 and it consists of finding "i q toid 
variance of the Stratified sample mean for a fixe pe tht 
osts of surveying the units, ie. pim 
m sample size n; is given by the re : 
N Ino’ for i = l; 2, away k. al 

eyman allo, i iO; soporto” 

_ allocation When ali thes becomes exactly the same as the pr°P® : 


“UM standard deviations are equal- 


stratu 


nisn, 


| 
| 
| 
| 


SURVEY SAMPLING AND SAMPLING DISTRIBUTIONS i 19 


. Example 14.5. Suppose a Population of N=9 is stratified into 3 
strata with the following measurements: 
Stratum I ‘Xy, = 1, X19 = 2, X13 =4 
Stratum I | X3; =6, Xoo = 8 
Stratum III Xqy=11, Xy5= 15, X. 

If two measurements are drawn fr 
‘state how many samples of size 6 could 
List these samples and compute the mea 

Here the population consists of 3 
units are to be selected to make up 


strate and from each stratum 2 
a sample of n=6. Assuming the 


sampling without replacement, we can choose H = 3 possible sub- 
2 
samples from the first stratum, G) = 1 possible subsample from the 


4 : i , 
second stratum and (3) = 6 possible subsamples from the third stratum. 


Each of the possible subsamples from stratum I is to be associated with 
the subsample from stratum II and then these combinations are further 
associated with each of the possible subsamples from stratum III, Hence, 
in all there are 3 x 1 x 6, ie. 18 possible samples of size'6 with 2 
measurements from each stratum. The first sample consists of (1, 2, 6, 8, 
11, 15) of which (1, 2) is from stratum I, (6, 8) is from stratum II and 
(11, 15) is from stratum III. The 18 possible samples are listed below and 
the sample means appear in the last column. - 


Sample Sample data from stratum Sample 
No [Tr a Means 


OON QONAN 


INTRODUCTION TO STATISTICA; THe 
OR 


atified random sample of size iat 
lowing population. Fing the eae 
tion mean. . « ple 


g. Select a sth 


mple 149: P” a the fo 


te iaito of the popula 


To select a strat 


allocation, ien nN 
4 
meng =g and 


No 3 
ny =n eae = 5% 5H j 


Using a table of random numbers, we select the. following 
subsamples from 

Stratum I: Xi =12, Xy = 14, 

Stratum II: Xag = 30, X4 = 40, Xop = 24. 


Now the sample mean of sample n is 


i=l islj=1 
120 
-24 14 + 30 + 40 + 24] = =E = 24- 


: : 
An estimator (a term to be defined later) of the population me y 
the stratified sampling, denoted by X,n is given by 
palé a ko 7 
aS N ÈN; Xi = > Wixi, where Xi 
i=1 


t=] 
emea WN, /N) is termed the weight of the ith stratum. 


Hence the estimate of the Population mean is 


RI 
= 
z 


» | 
is the ith sv 


4 
“10d E 240 _ 
10\3 10 


i 
n N° 


14.2.3 Systematic Random Sample. A sample of size n is 


defined to be a systematic random sample if it is obtained by choosing 


from 1 to N or arranged in a systematic fashion. The letter k, called the 
sampling interval, stands for some integer nearest to Xa Population size vlation; ae 
3 ý n Sample size 
and the sample is generally expressed by saying "a 1 in k sample”. For 
example, when the number of the unit selected at random from the first 
k units isi (i = 1, 2, ..., k), the systematic sample of size n will contain 
the units with numbers i,i+ k,i + 2k, wy È + (n -— 1) k. Suppose that 
i=7 and k = 20, then a systematic sample consisting of every 20th unit, 
will be composed of the units numbered 7, 27, 47, 67 and so on. Thus in 
systematic sampling, sampling units are selected at uniform interval 
after a random start. 


© When the sampling interval corresponds to some periodic or cyclic 
characteristic in the population, the systematic sampling will result in a 
non-representative sample. For example, suppose every 20th shop in a 


big bazaar is a corner shop and the sampling interval is also 20. If the ` 


random start coincides with a corner shop, then the sample will include 
all the corner shops, and the- sample will be highly non-representative as 
different characteristics (say, dealing in beverages) are associated with 
the corner shops. One way of avoiding of this sort of periodicity is to take 
a fractional sampling interval and then to’ round off the numbers 
obtained, e.g. if sampling interval is 20.7 and we start at 7, then the 
subsequent numbers are 27.7, 48.4, 69.1, 89.8, etc. and the rounding off 
results in 28, 48, 69, 90, etc. The sample will then consist'of the units 
With serial numbers 7, 28, 48, 69, 90, etc. 


if we think of the population as being divided into n strata, each 
Consisting of k units, then a systematic sample resembles a stratified 
random sample with one unit from each stratum. The advantages of 
Systematic Sampling are that it saves much time and effort, .it is 
economical, it is easily selected and. conveniently worked out. 
urthermore, a ‘Systematic sample is more representative of the 
Population sampled as the sample is more evenly distributed across the 
Population, ` 


INTRODUCTION TO STATISTICAL Theo 
R 


A random sample is said to be 
Sam aee at random groups of ; 
of = as sampling units) into which a 
e 


Uster 
; | 
Ndividusy | 


Population | 


S 
14.2. : sists 


co 
1 clusters (treat 
ide and then inc 


's, oF S 
ant In other words, suppose that a popula 


electing a random sample of the uni 


luding in the sample either all the units fon | 


groups (equal or unequal in size), calle 

f the cities, households, classes, es. 4 

sisting of a number of clusters (say, my te the 

Ta = ya se clusters, where each chosen cluster is either 

nee a y subunits in the sampled clusters are includeg in 
e 


subsampled o : sample is called a cluster sample. Cluster Sampling 
cha 


-idod into M smaller’ 
first divided eet blocks © 


the sample. Su «s should be as internally dissimilar as Possible 
i the clusters sho 
requires that 


and different,clusters should be very similar. ; 
The procedure is called one-stage cluster sampling when all the 

nee of the sampled clusters comprises, are included. If each 
ie Sad tunes is subsampled, then the sampling plan is called 
Ay a sampling or subsampling. The plan is called multistage 
ae ae when more than two stages are ii i es l 
sample. When the clusters relate to geographical areas, the sampling 
known as arca sampling. 


As each cluster is treated as a single sampling unit in the selection 
process, the clusters are therefore called the primary sampling she 
(psu), while the subunits composing a cluster, are called the secondary 
scmpling units (ssu). 


The advantage of cluster sampling is savings in cost and ume, j 
the cost of sample selection and travel expenses of interviewers ` 
considerably low, Cluster sampling is used when (i) the sampling na 
of adequate coverage are not available, (ii) the variations among mae s 
are smaller than the variations within clusters. It is to be noted tha 
cluster sampling is mostly used in statistical quality control. 

14.2.5. 
sample when it js sel 
being subsampled fro: 
Here a Population js 
units, which are subs 
further divided into t 
Selected and SO on, F 
sample of ny district 
Villages (second-sta 


r rage. 
m the larger units selected at the previous a 
divided into a number of units, called aes is 
pled. Each of the selected second stage oat 
hir d-stage units, from which a subsample 6 ndom 
or example, in a sample survey, we select a a ofn 
S (first-stage units), then we take a subsample 


e uni i eg again, 
Select a sub-sample, nits) from each of the selected districts, 48 


‘stage 
Multistage Sample. A sample is called a multistage | 


$ A stage | 
erted in stages, the sampling units at each Í. 


„Probability theory, 


0 
Selected Villages i 3 households (third stage units) non e ist 
~> on, In a multistage sa he sample 
ge sample, the s3 


| oooO 


23 


Two -phase sampling, also called Dou 
for the first time by Neyman. When 
multi-phase sampling. It is import 
sampling, the same units are used at 
sampling, the units are different at 
‘advantages of multi-phase. sampling a 
reduces the burden on respondents. 


14.2.7. Sequential Sampling. This is another 
sampling where the sample size is not fixed in advance but sampling 
units are drawn on by one or in lots, and the decision is based on a 
definite rule relating to the sampling units themselves, That is, we draw 
one unit at a time and after each drawing, we make a decision whether 
to accept the lot or group, whether to reject it or whether to continue 
sampling. A graphic or tabular Procedure is generally used to find when 
sampling should terminate. This technique was developed during World 


War II by Abraham‘ Wald (1902-50) and is good for reaching decisions 
rapidly. 


more phases are added, it becomes 
ant to note that in multi-phase 
each phase, whereas in multi-stage 
different stages of sampling. The 
re that it can prove cheaper and it 


method of 


14.3 NON-?ROBABILITY SAMPLE 


A non-pro*ability sample or a non-random sample is also called a 
Judgement sample. Th. important types of a non-random sample are the 


Durposive sample and the quota sample. They are briefly described in the 
subsections that follow. 


14.3.1. Purposive Sample. A prrposive sample is a non-random 


‘sample in?which the selection of the sampling units is based on a 


Per'son’s expertise about the population. A purposive sample is liable to 
bias to be introduced by the deliberate su jective choice of the person 
Who selects the sample. As the purposive sampling is not based on 
there is therefore no objective method for measuring 
the reliability of the sample results, and hence the information gathered 


Tom such a sample cannot be made a basis for statistical inference. 

The Purposive sampling in spite of these cbvious drawbacks is in 
Several situations preferred over probability sampling and gives quite 
Satisfactory results. For instance, when taking a sample of melons from :: 


INTRODUCTION TO STATISTICAL Tye 


ects the whole load and th 
tive judgement, those melons 


A urposive sampling ca 
according t0 be representative. P Pe latively few d $ 
nsiders opulation contains re -Crge uni 
spproril® when i oe Its main use, however, is in Econ 
a sas al’ : 
tics are 


ris 
charactenis™ > | 
Business statistics 


le. 

e O of 
sample. It pA from the segments of a population (the quotes, 
is collectea ae se and women; urban and rural; upper, middle a 
eg. the quotas 0 a etc. These factors are termed quota controls, They | 
lower income gla the sample as representative as possible and th 
are intended to ia that creeps in because the selection of respondents 
reduce n a on the personal choice of the interviewer 
a human, are likely to look for persons who either 


share similar opinions or are personally known to them or ar 
~ conveniently located. 

Quota sempling may be considered as stratified sampling in which 
the selection of units within strata is non-random. The advantages o 
quota sampling are that it is cheaper, it is easy administratively and it i 
a very quick form of investigation. Quota sampling is widely used | 
public opinion polls and market research surveys. | 


en Selec. 
Which he 
also i 
ts Whos 
Omice ang 


24 the sampler insp 


truck-load, his expert subjec 


A quota sample is a type of judgemen, 
human being, in which the informatio, 


14.4 SAMPLING DISTRIBUTIONS | 


A sampling distribution is defined as a probability dishes : 
the values of a statistic such as a mean, a standard oe? j 
proportion, etc, computed from all possible samples’ of the pam ee 
which might be selected with or without replacement from a popu wie | 
“S a sampling distribution of a statistic is a probability ao , 
therefore the sum of all probabilities in it is always equal to one; aní 
distribution has its own mean and its own standard deviation. | 
values of the statistic computed fr samples actua”) | 

om one or more be 


sel gia 
mee from the population and the sampling distrisution i out | 

€ provide all the information o Siame ma 5 
the values of the peleme, 1 a 
sampling distributi 


Population parameters. There are many | 
inference are the b 


° tatis 
“i but the most frequently used types i” $ i-squet | 
distribution ni inomial, the normal, the t-distribution, the g uld o 
be confused ae he F distribution, A sampling distribu Spution di 
E ia as naa eRe fan istr | 
Individual values tee distribution which is the di | 
gle sample, i 


SURVEY SAMPLING AND SAMPLING DISTRIBUTIONS. - 25 


. Standard:=rror. The standard deviation of a sampling distribution 
‘of a sample statistic is called the standard error (abbreviated to S.E.) of 
the statistic. The standard error thus measures the dispersion of the 
values of a statistic, that might be computed from all possible samples, 
whereas the standard deviation of a population (or sample) measures the 


dispersion of the values of the population (sample) units about the 
population (sample) mean. ' 


14.4.1. Sampling Distribution of the Mean. The sampling 
distribution of the mean is the probability distribution or the relative 
frequency distribution of the means X of all possible random samples of 
the same size that could be selected from a given population. The mean 


of this distribution is represented by uz and the standard deviation, 
which is called the standard error of the mean, by Oz or S.E. (X). The 
value Oz indicates the spread in the distribution of all possible sample 
means. : 

The sampling distribution of X has the following properties. 

(i) The'mean of the sampling distribution of the mean (equivalently, 
the mean of all possible sample means) is equal to the population mean, 
that is {tz = 1, regardless of whether sampling is done with replacement 
cr without replatement. 


Proof. Let us first consider sampling without replazement from a 
finite ponulation of size N. The number of distinct simpie- random 
samples of size n that can be selected without replacement from a 


population of size N is e) = k, say. Let X,, Šo ..., Xi, ..., Xp be the 


means of k = (*) possible random samples of size n, where X; is the 


mean of the ith sample. Then the mean of the sampling distribution of X - 
‘equivalently, the mean of all possible sample means), denoted by ptz, is 
53 5,4 2%, 
rs 


X,+Xot... XM, +Xat...\- Xo+Xat... : 
Be ee p, a 


In order to simplify the expression on the right, we find out the 
number of samples that contain any specified value X;. The number of 


such samples is ine that is, the number of ways in which the (n—1) 


other units in the sample are to be selected from the remaining (N—1) 
units, 


INTRODUCTION TO STATISTICAL THEO 
Y 


26 ine the co-2fficient of the vale X; by collecting a 
Next, Leaner containing X;. Thus the co-efficient of eh, 
' t ee 2 . 
the terms in 5 
N-1 
a inl NNa 1d 
wu ae “Mn (n)! N-n)! n N 
() : 
X; Xn 
S ia wt N. 


~ Hence te = Wy * NV 


Xit Xot + Xi t + Ay 
Ag E n ai e 
= N 


- = |u, mean of the population. 


7 Simplige with “replacement. Let Xy Xo, 
observations of a simple random sample of size n from a population 
having N observations, Then a specified X; taken from the population, 


' i zii 1 
could be any one of the N values with'an equa: probability of n2s all-the 


values are equally likely. Thus X; isa random variable and therefore 


LN 
E(X)) = N > ži =p. 


For repeated sampling, the mean of a sample X = iy Z; varies 
n 


from sample to ‘ample, therefore 


ae 
EM = 8[* 5 x] 


bs 


S e 


, 
"Ria ER) +... + EX,)] 


aži 
a BAe eu 


== [nu] « I th 
n > "e population mean. 


Xn be the’ 


SURVEY SAMPLING AND SAMPLING DISTRIBUTIONS 27 
iii ne o o A 


(ii) | The standard deviation of the Sampling distribution of the mean 
is given by 

opa oq (EB. ea yopulationsd) 
‘x Ae N1’ = population s.d. 


when sampling is performed without replacement from a finite 
population of size N, or : 


when sampling. is done with replacement from a finite population 
or sampling from an infinite population. 


Proof. The variance of X, denoted by o? or Var(® is defined as 
x ba 


Var(X) = E (X- E(X)]? = E [} - py? 


n 2 
ELLY a@,-p] 
ars | 


1 n n 
-ELEK mE- w a-w] 


ial iæj 
1 no 1 i 
- ELD œ- w] + ELD -wgw 
i=l . , izj : 
The simplification depends on whether the sampling is performed 


without replacement from a finite population of size N or sampling is 
done with repiacement. The two cases are treated separately. 


First case: Sampling without replacement. 
Since the probability of obtaining (X; — H)? on the ith draw is equal 


therefore 


sensi, ok 
to the probability of obtaining X; on the ith draw which is Hy 
the expected value of (X; — p1)? becomes C2, i.e. 


N 
1 
E(X; — w)? = Diy Ai - WwW? = oF. 
i=l : 
Again, since the sampling is without replacement, the probability of 


3 a Me | 1 
i selecting (X; - u) (X; — 4) on the ith and jth draw is N° W- T because 
they are not independent on account of the reduction in size from N to 
-N~1, Thus 


INTRODUCTION TO STATISTICAL THE 
Ry 


21S ow ay 
Bow AY KN N= Lig 


N 2 N 
—t_ ff ya-wl) = Se ad 
* NW -1) {Ly al : w} 


Substituting these values, we get 


' 12 r2 o2 
Var(X) = 7 LO? + 2 

n’. A 

i=l. izj 


Hance oa S [N-n - 
yh N-1 


The factor Non . ; 
- N~1 ° Usually called the finite population correction 


from finite Population factor (fef) for the variance because in sampling 
the variance of the mear is reduced by 
from a finite popuatig °° that in sampling without neplacen ii 
Whene er n, the samp} Met Size N, foe is dropped from the formu 


. p. e . 
. 15 5% or greater than on less than 5% of V; end fpc is ‘used when 7 


SURVEY SAMPLING AND SAMPLING DISTRIBUTIONS 29 
Second case: Sampling with replacement 


When sampling is done with replacement or sampling from an 
infinite population, the X i and X, are statistically independent. Therefore 


E(X; - u) (X;- u) = 0. Hence we get 
= 1 2 
Var(X) = n? (no?) 5, 
pe o- = -Z ~ Population standard deviation 
Vn Square root of sample size 


It is to be noted that the standard error of the mean is always less 
than the standard deviation of the population. This means that the 
sampling distribution of the mean has less variability than the 
population from which the samples were taken. If the value of © is not 
known and if the sample size is large (asa rule of thumb adopted by 
many authors, a sample containing 30 or more observations constitutes a 
large or sufficiently large sample), it is replaced by s, the standard 
deviation of the sample. The S.E. of the mean then becomes 


si =-=. 
* yn | 
(iii) Shape of the distribution. (a) if the population sampled is 
normally distributed, then the sampling distribution of the mean X, will 
also’be normal regardless of sample siz. a 


To prove this, we proceed as follows: 


By definition, the moment generating function of X is 


Mx) = E(e® = Beri") 


w ii E pr Xil: 
=E [Me""]-= ee. 
i=1 


i=l 
But Ee") = yu, (+) 
l ar EN ny 
- IfX is N(u, 62), then 
Mx(t) = entt%t?/2 , and 


t pl/n +402(t/n)? 
My (4) = (4 2 


Since X,, Xo, X is a random sample, therefore 


IN RODUCTICN TO STATISTICAL THe 


ory SURVEY SAMPLING AND SAMPLING DISTRIBUTIONS 31 


Example 14.7. Assiiae that a population consists of 7 similar 
containers havir 7 the following weights (kilograms): 


5.3, 30.2, 10.4, 9.8, 10.0, 10.2 and 9.6. 


(a). Find the mean u and the standard deviation © of the given 
, population. 


$ gultari/2n 


e mgf. ofa normal distribution with mean = H and (b) Draw random samples of 2 containers without replacement and 


calculate the mean weight X of each sample. 
(c) Form a frequency distribution of X and a sampling distribution 
© ofž. i 


` (d) Find the mean and the standard deviation of the sampling 
distribution of X . i i 


But this is th 


ne £ Thus X is normally distributed variable with Mean H ang 
. variante = à 


o?/n where H and o? are the mean and variance a the 
variance i 

-opulation. p 

zo If the population sampled is non-normal, then for sufficiently | 
(b) he sample size, the sampling ‘distribution of X¥ will 
approximate the normal distribution. 


This is a special case of the most important statistical theorem, 
known as the Centra: Limit Theorem, which is stated and proved in the 
- nexi section. 

We know that the standardized form of a vandom variable is 
obtaired by subtracting its mean from it and dividirg the difference by 
its standard deviation, that is 


(a). The population mean H and standard deviation o are 


2X _ 984+ 1024+...496 0. ` 
wa FE 282102 toe 00 oong and 
(eJ 


7 
-4 [EZ-p? (9.8—10)2+ (10.2-10)2+...+ (9.6-10)2 
7 V T | aaa a 


A = 
= 2a = 0.0686 = 0.262 kg. 


(b) Let the containes be identified as A, B, C, D, E, F and G. Now 


Value of random variable — mean of random variable the number of possible random samples of n = 2 containers 
Ze : 


standard deviation of random variable without replacement is (3) = 21. The 21 possible random 


We have proved above that the sample mean X is normally distributed 
_ Tandom variable with mean equal to population mean p and standard 
0) 


deviation T The standard normal veviable then becomes 


samples with the values of their mean weights are given on 
page 32: 


(c) The frequency distribution of ¥ and.the sampling distribution cf 
the mean X, which is just the relativ; frequency distribution of £ 
, are obtained below: 

ae S's 
oe ore | 
itn t replacement and sample size n is 5% or aye 

o! no ; 

the formula ê Population size N, then Z valves are obtained by 


(i) Frequency Distribution of X (ii) Sampling Distribution of X 


Probability 
fe). 


Z= <= ; 


Bs IN =n 
n aii 
The Sampling di Vi nes 


stributi l sant 
questions about t on of X thus offers solutions to probabi 


h f lity 
e values of the sample means. $ 


a 
INVRODUCTICN TO STATISTIe 
. 2 4 iX. 
uan- E (eÑ) = Ee Li 
My ne? Ss 


n ut 1 tne 
t = eh (—+ g 02l2/n2) z 
[My 5 i Ca + FOUAD - 


AL THEO 


30 


> 2/2 In 
o a giits / 


But this is the m.g. 


o? i ally distributed variable with m 
`. variance =.> Thus X is normally ean I 
yariance 02/n where p and o? are the mean and Variance of m 
zopulation. 7 
(b) If the population sampled is non-normal, then for Sufficient 
large sample size, the sampling ‘distribution of ¥ vil 
approximate the normal distribution. l 
This is a special case of the most important statistical theorem, 
known as the Centra: Limit Theorem, which is stated and proved in the 
- nexi section. 


We know that the standardized form of a wandom variable iş | 


obtaired by subtracting its mean from it and dividirg the difference by 
its standard deviation, that is 


Z= value of random variable — mean of random variable 
standard deviation of random variable 


We have proved above that the sample mean X is normally distributed 


random variable with mean equal to population mean p and standard 


deviation 2 


A The standard normal veriable then becomes 
z- Sa l 
Temi o/y\n | 
S i . ' 3 . 
than 8 pet cen at epulenett and sample size n is 5% of ar 
0 : è i 
the formula ê Population size N, then Z valves are obtained by 


ye Goa 


S, N-n 
Va VN- l 
rab 


Xx thus offi s i : 
ers solutions 
e Values of the Sample means. 


The sampling distribution of 


ility 
questions about th j 


f. of a normal distribution with mean a, | 
' U ang 


SURVEY SAMPLING AND SAMPLING DISTRIBUTIONS 31 


Example 14.7. Assume that a population consists of 7 similar 


containers havir z the following weights (kilograms): 


(a). 


(b) 
(c) 


` (d) 


(a) > 


(b) 


(c) 


(i) Frequency Distribution of X 


5.3, 10.2, 10.4, 9.8, 10.0, 10.2 and 9.6. 


Find the mean u and the standard deviation © of the given 
population. 


Draw random samples of 2 containers without replacement and 
calculate the mean weight X of each sample. 
Form a frequency distribution of X and a sampling distribution 
of X. — 
Find the mean and the standard deviation of the sampling 
distribution of X . es 
The population mean Hand standard deviation o are 

2X _98+102+...496 70.0 

= aoe ta tE 


al 7 = 7 = 10.0 kg; and 


tna , [Ex-w? _ (9.8—10)2+ (10.2-10)2+...+ (9.6-10)2 


A = 
= sa = 0.0686 = 0.262 kg. 


Let the containe“s be identified as A, B, C, D, E, F and G. Now 
the number of possible random samples of n = 2 containers 
without replacement is (a) = 21. The 21 possible random 
samples with the values of their mean weights are given on 


page 32: 


The frequency distribution of ¥ and. the sampling distribution of 
the mean X, which is just the relativ; frequency distribution of X 
, are obtained below: 


(ii) Sampling Distribution of Z 


Probability 
fx). 


INTRODUCTION TO STATISTICAL ae 
AETIA TH 


~in 


Samples 


SURVEY SAMPLING AND SAMPLING DISTRIBUTIONS 33 


Sian ome a s 


(d) The meat and'standard deviation of 
eze computed below: 


Probability | = 7G) |, z- : 
| Sk 


_ Ft) 
2/21 | 19.4/21 0.3 


sampling distribution of X, 


(pz)? FR) 


0.16/21 


2/21 | 19.6/21 


0.08/24 
4/21 | 39.6/21 0.04/21 | 
5/21 | 50.0/21 | 0 
4/24 aii 0.04/21 | 


2/21 20.4/21 9.08/21 


2/21 | 206/281 0.3, 


16.¢ 


Teles 
Hz = LX fŒ) = 10.0 kg, and 


oz =E E- py)? pe) = 4 [SÈ = yO0286 = 0.17 kg, 


which js a smaller value indicating that the sampiing distribution of the 
mean is more concentrated about the population mean. 


0.18/21 


Example 14.8. A sample of size n=3 is to be randomly selected 
without replacement from a population that nas N=5 items whose values 
are 0, 3, 6, 9 and 12. f 


(a) Fira the sampling distribution of the sample mean, X. 


(b) . Calculate the mean and the standa:d deviation <f X, and verity 
that, 


2 0o? N-r. 
Ga, z 
x ln. N-J 


Let the items be designated by the letters 4, 3, C, D ard £. 


\ 


(a) The number of samples of size n=3 thet gould be ¿rawn without 
replacement from a poptiation of size N=5 is 


-fa 


INTRODUCTION TO STATISTICA, 


a 
wt 


a” 


SURVEY SAMPLING AND SAMPLING DISTRIBUTIONS ai 


(d) The meat and standard deviation of sempling distribution of X, 
eze computed below: - 


Probability | = 7G) J, z. s G-U)? fG) 


zdi f(z) 
0.1¢/21 


19.4/21 | -o3 ` 


19.6/21 0.08/22 


39.6/21 0.04/21 | 
50.0/21 | 0 | 
`i 40.4/21 0.04/27 


20.4/21 9.08/21 


0.18/21 


Mz = È% fŒ) = 10.0 kg, and 


o =F @- iz)? fE) re ae 0.0286 = 0.17 kg, 


2 


which js a smaller value indicating that the sampling distribution of the 
mean is more concentrated about the population mean. 


Exampie 14.8. A sample of size n=3 is to be randomly selected 
without replacement from a population that nas N=5 items whose values 
are 0, 3, 6, 9 and 12. 


(a) Fira the sampling distribution of the sample mean, X. 


(b) Calculate the mean and the standa:d deviation cf X, and verity 


that, . . 
2_ 6% N-r. 
Ca, "h 
x IN. N-1 J 
\ 
Let the items be designated by the letters 4, 3, C, D ard £. N 


(a) The number of samples of size n=3 tnet čobld be ¿rawn without 
replacement from a population of size N=5 is 


33 


INTRODUCTION To STATISTICA 


f Heg : 
34 nd their means are given below: q suRVEY SAMPLING AND SAMPLING DISTRIBUTIONS 


35 
= z les a 
The 10 possible samp Calculation of Mean and S.D. of Samuling Distribution of ¥. 
Sample — ital Ji 
Combinations 


0 
Now Hz = 2X f(z) = x = 6, and 


ono a DON A O P 


m 
© 


o; -VE -Ero 
The sampling distribution į 


s obtained by listing all possible mean | : s4 J220 (3) = 39 - 36 = 3 = 1.732 
and their probabilities (relative frequencies) as below: = N ' 


Sampling Distribution of X 


Probability 
f&a) 


In order to verify the given result, we first calculate the mean and 
the variance o2 of the given population. Thus 


Number of 
sample means (f) 


H= 500+ 3+ 6+ 9+ 13] = 180] = 6, and 
o? = = [(0-6)2+ (3-6)2+ (6-6)2+ (9-6)2+ (12-6)2] = 18 


n N-1 3°5-1 3x4 x 


2 = oan A 
Verification: Now Š, Nan = CR = = 
Hence the result. 


Example 14,9; Suppose that a random variable X has the 


following Population distribution: i 


1/3 

i If a sample of three numbers is taken with replacement, cbtaia the 

(b) Na» SMpling a: i 5 

standard Xt we alate t (the "pling distribution of the sample mean and verify that Oz == 
y error) of the sa 


"i 


© mean an` 
'N8 distributio, 


the standard deviation 
of the mean as; follows: 


rhe population 
1/3 forx = 3 
apye a 41/3 fore = 6 
fa) = PE *) {38 forx = 9 


INTRODUCTION To § 


TATISTic 
Ro 
SX may be write ee 


36 distributi vú of the r. 1.X may be written as | 


ae that the members of the population have the numen 
uy E Tics 


values of 3, 6 and 9. 
The n 


with replacement from this 
with their means are given below: 


Kome- e e o a S F&F CYP eH 


m 
So 


pa 
= 


m 
N 


m 
oo 


= 
A 


m 
n 


a 
o 


pi 
~ 


m 
co 


= 
w 


w 
© 


N 
= 


N 
N 


N 
wo 


b 
A 


N 
n 


N 
o 


N 
~ 


o o oeoo oO Oe OoOOLOOAaOAaOOAOOO OO w www io o oo oo 


~ 
~ 
ee 


umber of possible samples of size n=3, which coy 
population is 33=27. The 27 ra 


COMPO AWOAWORWORWDORWORWORwWOD 


joe se sosonererauswasonoanane 


Id be Select; 
cté 
ndom sample 


SURVEY SAMPLING AND SAMPLING DISTRIBUTIONS 


The sampling distribution of the sample mean X i i 
) 2 s obtained bel 
together with two columns needed for the calculation of the S.E ef the 
mean: l 


Sampling Distribution of X and Calculaticn of S.E. (X) 


No. of sample 
means 


-9/27 

48/27 
150/27 
252/27 


294/27 
192/27 


« Sere 2 
Now, H = Dx f(Z) = o7 = 
o: = De? f) - [XF AB)? 
1026 
= -57 ~ (6)? = 38-36 = 2. 


And, d= Ie fiz) = 3xi+6xi+9xi=g, 


o2 = Dx? fx) - u? = [9 x} + 36 x2 + 81 x į] - (6)? 
= (3 + 12 + 27) - 36 = 6. 


2 
Verification: vate 2 = 9 
n 


Hence, Oz = 2., 
n 


i Example 14.10. The weights of 1500 ball bearings are normally 
distributed with a mean of 22.40 ounces and a standard deviation of 
048 Ounces. If 300 random samples of size 36 are drawn from this 
Population, (a) determine the expected mean and standard deviation of 
3A Sampling distribution of mean if sampling is done (i) with 
P'acement, (ii) without replacement; (P.U. B.A./B.Sc., 1971) 


ow : 
INTRODUCTION To STATISTICAL 
T 


HEg 
Nt 
Mean, 


98 
many 
) How d 99.42 02? 


of the random samples would have their 


between 22.39 an 


(1500)36 and ay possible sample 
(a) There would be 36 S Of size % 


d be obtained theoretically from a Population of Weights - È 
ball bearings with and without replacement respectively. Obviongy i 
number of theoretically possible samples is much carger than N 
Therefore the sampling distribution of the mean will not be a a 
sampling distribution. (Such a distribution is called experimenty 
sampling distribution). But 300 being a large number, there shouig bea 
close agreement between the experimental sampling distribution of 309 
sample means and the true sampling distribution of mean. Hence the 
expected mean and standard deviation are found to be as: 


that coul 


(i) Sampling with replacement: 


Hz -+ H = 22.40 02., 
gee a OS a Aibo 
yn 36 


(ii) Sampling without replacement: 


Hz = H = 22.40 oz., 


* h: NN-1 


Since sample size n=36 i Ree 
=36 n size 
N=1500, theref is less than 5% of the populatio 


ore according to the generally accepted rule for the use of 


fpc, the fact N-n. 
or N1" dropped. Thus 


Oz = LA = 0.048 = 
vn 36 = 0.008 oz. 
(b) ‘The samai 
mpling distr}, he 
Population se distribution of the mean X is normal because t 


ampled is normally distributed. Thus 


is 
Za He E2249 


Oz 0.008 isa standard normal variable. 


corresponding z-values. Thus 


at Z = 22.39, we find z= 7200-2240 = —1.25, and 


at Z = 22.42, we find zy = 2242—2240 _ oo 
0.008 ` 
Using the Table of areas under the normal curve, we find 
P(22.39SXS22.42) = P(-1.25 < Z < 2.50) 
` = P(-1.25<Z<0) + P(0 < Z < 2.50) 
= 0.3944 + 0.4938 = 0.8882. 
Hence the expected number of samples = (300) (0.8882) = 267. 


Example 14.11. A construction company has 310 employees who 
have an average annual salary of Rs. 24,000. The standard deviation of 
annual salaries is Rs. 5,000. In a random sample of 100 employees, what 
is the probability that the average salary will exceed Rs. 24,500? 


The sample size (n=100) is large enough to assume that the sampling 
distribution of X is approximately normally distributed with mean 

Hz = H = Rs. 24,000. 
and standard deviation 


o- =- 4 [Nr _ 5000 , /310- 100 
* an` NN-1ı 100 310-1 
= Rs. 412.20, 


where we have used fbc, because the sample size n=100 is greater than 
5 per cent of the population size N=310. 


Equivalently, Z = EE - a ee approximately-N(0, 1). 
OF 412.20 
We are required to evaluate P(X > 24,500). N 


24500 — 24000 _ 
412.20 


Hence using Table of areas under normal curve, we get 
P(X > 24,500) = P(Z > 1.21) 


s= 0.5- P(O <Z<1.21) 


At X = 24,500, we find that z = 1.21 


= 0.5 — 0.3869 = 0.1131. 


INTRODUCTION TO STATISTig my 


12 
following da ning af a. particular class. es ty 


find average: ear! 


Barning | +-19 | 11-20/21-30 | 31-49 | 41-50] 51-60 om 
(Rs.) 
of persons 


1 Since the population standard deviation © is not known 
sample size (n=1000) is large enough to replace it with the 
standard deviation s, we therefore first calculate the sample 
deviation as below: 


and the 
Sample 
standard 


612 3208 
Thes 
. ample mean and the sample Standard deviation are; 


Z= a4 Llu 


612 
1000 * 10 = Rs. 41.62, and 


Calculate the standard error of the mean 


f [Replacing ¢ by 


SURVEY SAMPLING AND SAMPLING DISTRIBUTIONS 
SRE ee ee 


om a population has mean 
distribution of the sample 
mean X approaches a normal distribution with mean u and variance = 
as the sample size n approaches infinity.” 


To prove this theorem, we use the moment genetating function 
about the mean. By definition, we have 
-u)t 
M,(t) = E [e*""] 


t? t3 
1+ bit + Heo + Hg apt o 


ot? t3 
“140+ + Hey 


, Let us define a random variable Y as Y = Seal i 
oyn : 


Then the m.g.f. of Y is 
My(t) = E [p VE] = Ma(t{o Vn) 
wipe) ett et ay 
acc) “al ea “tle * 


t 


ovn 


in M,(t)} 


2 3 t4 
TTR ag Eg, 
2n  3!03n Vn 4loin 


Let us define another variable Z as a linear function of X as 


X= 1 2%-0 Zoh 2 
geet tl = = YY; 
n Sole ai 


We know that the m.g.f. of a sum of identically distributed random 
Variables is the nth power of their own m.g.f. Thus : 


INTRODUCTION TO STATISTICAL THEO 
- Ry 


z HO = MJOF 


2 Ş Hat 
A okie 
l] random variable with 
-< is the mg.f. of a norma X Vi zer 
ea pa ee Z has in the limit a standard 
and unl i 


distribution. 


n 
2 
+] = ę!/2 as ny 


O mean 
normal 


Zo 
Now PE 


Since a linear function (here X) of a normal random variable (Z) isa 
normal random variable, therefore X is in the limit normally distributed 


with mean p and variance 02/n. 


It is interesting to note that we have neither assumed that the 
distribution of X is continuous, nor we have said anything about the 
shape of the distribution of X, whereas the limiting distribution of X is 
continuous and normal. Thus the distribution of the sample mean 

regardless of the shape of the population distribution but having a finite 
j variance, is approximately normal with mean H and variance o2/n. 


Example 14.13 Given the population 1, 1, 1, 3, 4, 5, 6, 6, 6 and 7. 


(a) Find the probability that a random sample of size 36 selected 


7 replacement will yield a sample mean between 3.26 and 


©) Find the mean 
distribution of me: 
without Teplaceme 


3 
at least I of the sa 


and standard deviation for the sampling 
ans for a sample of size 4 selected at random 
nt. Between what two values would you expect 
mple means to fall? (P.U., B.A./B.Sc. 1986) 


The m 
ean and standard deviation of the population are: 


To calculat 


‘ em 
Population bY the ia Standard de 


viation, we may describe the 


g Probability distribution: 


and atx = 4.74, we find 


T, 
A 2 de * A 
4 A rt Ft 
SURVEY SAMPLING AND SAMPLING DISTRIBUTIONS A n 43 


(a) As the sampling is performed with replacement, therefore a 
sample of any size can be selected. A sample of size n=36 is large enough 
for the central limit theorem to apply. The sampling distribution of X is 


therefore approximately normal with mean Hy =U = 4 and OZ = Dg 


T 
at = 0.373, that is 


X-u, y 
= A = x-4 i i \ 
oz = 0373 8 approximately N(0, 1). 
To find the probability that the mean of a random sample of size 
n=36 will fall between 3.26 and 4.74, we transform 3.26 and 4.74 to z 
values. Thus at x = 3.26, we find 


3.26-4 | 
0.373 


—1.98, 


ME 106, 
` 3.26 Hq 4-74 X 
Hence using Table of areas gat Z 


under normal curve, we find "nS 0 198.3 


P(8.26SX<4,74) = P(-1.98 < Z < 1.98) 
= P(-1.98 < Z < 0) + P(0 < Z < 1.98) 
= 0.4762 + 0.4762 = 0.9524. 


(b) As the sample is without replacement and sample size n=4 is 
greater than 5% of the population size N=10, therefore the mean 
and standard deviation of the sampling distribution of X, are 


Wy =p= 4, and 
og ee Em 
Xn \VN-1 
= 2236 | (10-4 | (1.118) (0,816) = 0.912 
4 10-1 


1 ; 
The Chebyshev’s inequality says "at least ( - 2) fraction of the 


x 3 
data lies in the interval mean + k(s.d.)" and the problem says "at least Fi 


‘differences ¥. Fha 


44 
of the sample m 


e “ 3 . 1 
eans should fall in the same interval," so 438 1 — 


~; 


kò 


that is 


= = 2. 
ee ee 
Hence 


that is between 4-2(0.912) and 4+2(0.919) op 
interval py + 20%, that is 


tween 2.2 and 5.8. 
Example 14.14 A random sample of size 25 is selected from a 


Poisson distribution with p=3. Find, using the central limit theorem, the 
Po 


probability that the sample mean will be greater than 4. 
Let X denote the Poisson distribution with 1=3. Then Var(X) = 3. 
5.: i 3 
By the central limit theorem, X is appoximately N | 3, 25): 


‘We require P(X > 4) 
Thus P(X > 4) = {ee > 4 


OF 3/25 
= P(Z > 2.89) = 0.0019 


14.4.3. Sampling Distribution of Differences between Means. 
Suppose we have two large or infinite 
and variances of and o? respectivel 
of sizes ny and ny 
diXerences t- 


populations with means H and Ho, 
y. Let independent random samples 
be selected from the respective populations, and the 
between the means of all possible pairs of samples be 
robability distribution of the differences Xi f Xo can 
distribution is called the sampling distribution of the 
wine X, = X,. The sampling distribution of the 


5 roperties;: 


2 has the following p 
The mean of the sampling d 


u : istribution of X,-Xo, denoted by 
-y >» Is a 
that'g ual to the difference between population means, 


pe RES: 


(ii) The standard ER -2 = E(X,) — E(X,) = yb) 
ard deviati he 
(standard eror of $- ne Sampling distribution of XX 


2), denoted by OF ey is giver. by 


STATISTICAL T ; 
INTRODUCTION TO | HEO 
ICAL THEORY 


we would expect at least Z of the sample means to fall in the. 


the same population, the expression for the S. 


Ox, -X, ny No ` [. varc,~ 2) = Var(ž,)+Var(ž,) = 


o? oF 
ny F ree (Samples are independent) 


This expression for the S.E, of X,-X, also holds for finite 
populations when sampling is performed with replacement. When 
population standard deviations are equal or both the samples come from, 


E. becomes 


If the values of O} and ©, are not known and if both sample sizes 
are large, they are replaced by 5, and So, the standard deviations of the 
respective samples. The S.E. becomes : 


Ist s2 
=] 2 
S ngA pA 
X-X, ny Ng 


If, on the other hand, the populations are finite, sampling is done 


` without replacement and the sample sizes are larger than 5 per cent of 


the population sizes, then the S.E. is 


o3 N-n o3 N-n 

ad Se, 2 E L 

ni N,;-1 M Ng-l 

Gii) Shape of the distribution. If the populations are normally 
distributed, the sampling distribution of X,—X,, regardless of 
sample sizes, will be normal with mean jl, — [ly and variance 


2 2 
Oo oO , 
+., In other words, the variable 
ni ng 
Xi- Xo) - (H — Ho) 
z-& 2) — (pty = fly) 
T 2 
a, 9% 
D E 
ny ng 


i ‘i . If the 
is Normally distributed with zero mean and unit paria i ba 
Populations are non-normal and if both sample sizes are large, (2 z 


INTRODUCTION TO STATISTICAL Tue 
* THEORY SURVEY SAMPLING AND SAMPLING DISTRIBUTIONS 


46 — 
then the sampling distri 
approximately a normal by t 
15. Draw all possible random samples of size PA 
m a finite population consisting of 4, 6, 8. Simila. 


dom samples of size n=2 with replacement thas 


bution of differences between Mean 
47 


he central limit theorem. S iş 


Differences of Independent Means 


Example 14. 
with replacement fro 


draw all possible ran npl 
another finite population consisting of 1, 2, 3. 


Find the possible differences between the sample means ot the 


(a) 
two populations. ` 


(b) Construct the sampling distribution of X,-X_ and compute its 
mean and variance. 


Q 
w 

Q 
NN 


(© Verify that uz _; = [y-Hyando® z =— + — 
He Hy Ha ax, ny ny g 
There are (3)? = 9 possible samples which 
; ' can be drawn with ` 
replacement from each population. These two set: i 
means are given below: Dipoi onan 


(b) The sampling distribution of X,-X, (i.e, the relative frequency 
distribution of the possible differences X-X) is constructed 


From population 1 i 
below and the mean and variance of this distribution are also 


From population 2 


Sample Sample computed below. 
values No Peat 
s robability 


fŒ) 


} 1/81 1.0/81 
3/81.|  4.5/81 
10/81 | 20.0/81 


15/81 |  37.5/81 
30/81 | 90.0/81 
35/81 | 122.5/81 
52/81 | 208.0/81 
45/81 | 202.5/81 


50/81 | 250.0/81 
(a) 33/31 | 181.5/81 
30/81 | 180.0/81 


84.5/81 
49.0/81 


13/81 
7/81 


INTRODUCTION TO STATISTICAL THEORy 
48 


Thus the mean and the variance are 
u 


TFG hh) fG, -3) 
Hz -E =X) f- 


- 824 
= Lafld) = -%7 = 4, and : 
og = Xd- pz)? fO) = Ya? fla) — (X apayy2 
12 
1431 _ Bea? 58g 5 
A ce 81 3 3 
(c) The mean and variance of the first population are 
Hy = wi = 6, and 
2 _ (4-6)? + (6-6)? + (8-6)? 8 
a. 5 y 
The mean and variance of the second population are 


1+2 
=H so and 


o? = (1-2)? + (2-2)2 + (3-2)2 9 
8 = = 
2 3 3° 
Now Hz =E, =4=6-2 = hi = by, and 
i 2 


Hence the result, 


+ Ce batteries Produced by company A have a 
zai set rt ith a standard deviation of 0.6 years. A similar 
i 7 company B has a mean life of 4.0 years and a 
on of 0.4 years. What is the 


tterie Probability that a random 
ete we pany A will have a mean life of at least 
Mean life of a Sample of 36 batteries from 


Sample of 49 ba 
“0 years More 
Company B? 


We are given the following data; 


Population Als 4.3 year 
1 ' 'S, O, = 
1 


Population B: iu 0.6 years, n 1= 49 


= 40 fi 
Years, © = 0.4 years, ny = 36 


3 PLING DI 
| gyaveY SAMPLING AND SAM STRIBUTIONS E ae 
ee . 
~ goth sample sizes (n,=49, no=36) are large erough tô assuma th at 


ne s«inpling distribution of the differences ži 
the £% Y 


Xp is approximately a 
normal with mean 


Hz -z, = ty T He = 4:3 — 4.0 = 0.3 years and standard deviation 
2 2 í a 


P Si, ofa + cate 
OEE, = ny ng 49 36 T VAN years. 


Thus the variable 
Š, — X) - (u - Ha) 
z=í 2 l 2 
Si 
e iinet 
ni ng 


0.1086 'S approximately N(0, 1). 


We ave recuired to find the probability that the mean life of 49 
batteries pro’uced by company A will have a mean life of at least 0.5 
years Inger than the mean life of 36 batteries produced by company B, 
that is we want P(X, — X, 2 0.5). 


Transforming X, — X_ = 0.5 to z value, we find that 


Sor 1’s. In other words, let 
nai ; rest, 
Y= 1 ifthe ith unit possesses the characteristic of interes oa 
i istic of interes 
= 0, if the ith unit does not possess the characteristic of in 3 


Then the mean is 


0.5 — 9.3 
0.1086 ~ 184 
Hence using Table of areas 
under normal curve, we find 
P(X,-Xy20.5) = P(Z > 1,84) dis, the 3 > Xi 
= 0.5- P(0 < Z < 1.84) 0 = 2 
= 0.5 — 0.4671 
= 0.0329 , ERa 
14.4.4. Sempling Distribution of Sample P e po 
Population proportion p may be identified with the PRES das a 
Where the mean is obtained from the units wtos2 possible va: 
either 0’ ` 


INTRODUCTION TO STATISTICAL THEOR 
aa 


Number of units having the characteristic of interest 
Tat Total number of units in the population 


where X represents the number of units having the 


’ 


x 
N . . 
characteristics of interest. 


Thus the mean is simply the proportion of 1’s in the population ani 
we write p for p, meaning proportion (usually called the proportioy i 
i success). À 
Similarly, the sample proportion P is defined as 


It is interesting to note that X=Y; is a binomial random variable 
and the binomial parameter p is being called a proportion of success 
here, The samp!e Proportion P has different values in different samples, 
It is obviously a random variable and has a probability distribution. This 
Probability distribution of the proportions of all possible random samples 

_ Of size n, is called the sampling distribution of P. 


The sampling distribution of P has the following important 
Properties: 


(i) The mean of the sampling distribution of proportions, denoted by 
j Hp» is equal to the population proportion p, that is ug = p. 
(ii) The $3 


“andard deviation of the sampling distribution of 


Proportions, called the standard error of P and denoted by 95, is 
given as o; = 


< 


n >? When the sampling is performed with 
replacemen ` or 


d where @=1-p. It is of importance 1? 


tha 
n 5% of the Population size iV. 
Wher t i 
ae t Population Proportion P is not knov 23d both the 
mpl? sizes are largs, then the samp-e preston F 


RVEY SAMPLING AND SAMPLING DISTRIBUTIONS 
sU 


3 S ————— 51 
j ae from sample data is used in place of p in the e¥aression for the 
gE. of P, getting 
j pa wae a 
s; = fÊ, where j= 1-5 
When the sample is selected without replacement from a finite 


-opulation of size N, the S.E. becomes 


(iii)’ Shape of the distribution. The sampling distribution of P is the 
binomial distribution. However, for sufficiently large sample 
sizes, the sampling distribution of P is approximately normal. As 
a rule of thumb, the sampling distribution of P will be 
approximately normal whenever both np and nq are equal to or 
greater than 5. 


j nE. ; 1 
lt helps to remember that we use a continuity correction of + > 


Whenever we consider the normal approximation to the binomial 


: T oR X 

distribution. Now, we need to use a continuity correction of $57 as ea 
Example 14.17. A population consists of N=6 values 1, 3, 6, 8, 9 
éid 12, Draw all possible samples of size n=3 without replacement from 
the population and find the proportion of even numbers in the samples. 


| Construct the sampling distribution of sample proportions and verify 


that 
() u= ii) Var(p) = 24, N". 
Hg =p, (ii) Var(P) n Nc 
Where q = 1 =p; P anè p are sample and population pr..vortions 


"spectively, 


The number of possible samples of size n=3 that could L: sz.+:ted 


Mthout replacement is (3) = 20. Let P represent the proportio. of even 


ina in the sample. then the 20 possible samples and the proportion 
_ Yeh numbers are given as follows: 


Sample 
Proportion (Ð 


ry 


co OND ON AUGUN 


m 
© 


m 
jas 


m 
w 


m 
ow 


m 
~ 


ae 
ao ao 


jà 
xa 


= 
œo 


D ë m 
© © 


No. of 


samples Probability 


- KP) 


-6 (10)? _ 1- 0,05. 
20 \20) ~ 20 


yaveY SAMPLING AND SAMPLING DISTRIBUTIONS 
G 
Ta o verify the given relations, we first cal 


$ culate the x 
proportion p and the population variance Pq. Thus Population 


p= z, where X represents the number of even numbers 
3 
=-=0 5, and 
6 
o? = pq = (0.5) (6.5) = 0.25 


Therefore p5 = 0.5 = p, and 


0.25 6-3 0.25 ` r 
= “aa = ir = 0.05 = Var(P) 


Hence the result. 

Example 14.18. Ten percent of the 1-kilogram boxes of sugar ina 
large warehouse are underweight. Suppose a retailer buys a random 
sample of 144 of these boxes. What is the probability that at least 5 per 
cent of the sample boxes will be underweight? 

Here the statistic is the sample proportion P. 

The sample size (n= 144) is large enough to assume that the sample 
proportion P is approximately normally distributed with mean 


H5 = p = 0.10, and standard error 


„_ Pa _.. [K010 0.9) _ 0.3 _ 
ZENE V 14 ag 005 


, Therefore Z = Pr 3 P-p 7 P-p 

op \pq/n 0.025 

We are asked to find the probability that the sample proportion of 

the underweight boxes is equal to or greater than 5% i.e., we require 
Pp > 0.05). 


is approximately N(0, 1). 


Me Pga p -an Continuity correction] 
PEM Fp sear). Smeeant l 
p — - - 0.10 
_ p (20-10 „ (0.05-1/288) - 0 ) 
0.025 0.025 


= P (Z 2 —2.14) 
= P(-2.14<Z<0)+POSZ<S 0) 
= 0.4838 + 0.5 = 0.9838 (From area table) 


i 14.4.5, Sampling Distribution of pifisrances Welmer 
0 . k P 
Portions, Suppose there are two binomial popula ilors 


topoitions of successes p; and pg respectively. Let independent ra 


INTRODUCTION TO STATISTICA yy 
~ THE 


54 
samples of si 
the differences 21 oo „ probability distribution of the differ 


be drawn from the respective Population, 
the proportions of all possibile...” 


Sx 

pnpa . be obtained. Such a probability distribution is “i 

PrP a sribution of the differences between the proportions 7 

sumpling ats :ng important properties: es, 

whic’, has the following 1mp i _ 

The mean of the sampling distribution of P,—P,, ve 
WS p> is equ 


al to the difference between the Popul; 
proportions, that is Hp -P, =P Pe 


The standard deviation of the sampling distribution of E 
A OR È Š F] 
(i.e. the standard error of P-P) denoted by op -p, is given by 


Pe 
zes n: and ng 
2 between 


computet. 


by 
tion 


&) 


(ii) 


Pid, , Pole 
ns = [——+—, whereg = 1-p. 
ODP NV ™ ™m i P 


If both populations have the same proportion of successes, i¢ 
pı =P2 = p or if both the samples have been drawn from a common 
binomial population, then 


ons =A {pg (> +> 
SPP, Py ny n 


Whenever the value of the common proportion p is not known, then 
aoe large sample sizes, it is replaced with its estimate Do 
which is computed by taking a weighted mean of the two observed 
sample proportions p, and P3 as follows: 


a 


A 
np,tn a 
-MP1 + ngpa _ Sum of successes in the two samples 
n i i 
1t Ng Total sample size 


The standard error of P) — P, then becomes 


S* da o 1 1 A a 
Pi P, Pe qe n + =l where qe = 1 -po 


+5 on tke other han 
sample sizes, i 1 * Pz and also are not known, then for lar? 
respectively, Th e rep.aced with the sample proportions P1 ii 
- the S.E, of P, -= P, then becomes 


Saa 2. p 

Gii) Sha i 

1DE Of the distin, ^ A i 

appro. The sampling distribution of Py ~ P2" 
mal for Sufficier tly large sample sizes. 


_ SUR 


VEY SAMPLING AND SAMPLING DISTRIBUTIONS 55 
i le 14.19. Two ran : 
Example dom samples of sizes n=40 and no=45 


ae drawn from a binomial population with p=0.60. What is the 


probability that -0.15 <P, -Êz < +0.15? 
Both sample sizes (n,=40 and Ng= 25) are large enough i assume 
that the sampling distribution of P, -Pis approximately a normal with 


mean 


Hp -P, = P; — Po = 0, and standard deviation 


Ld 
Op-p = as 1 1 
PP, pq (2 + =) (0.60) (0.40) (a y% 3) 


= (0.24) (0.0472) = 0.106. 


Thus the variable 


z (P; =P) -0 

Bog 

ats 
P,-P, 


-106 8 approximately N(0, 1). 


0.15 


Now, at Py =o = —0.15, we find that z = T108 = —1.42, 


Hence using Table of areas under normal curve, we find 
P(-0.15<p, —py<0.15) = P(-1.42 < Z < 1.42) 
= P(—1.42 < Z < 0) + P( 0< Z< 1.42) ` 
= 0.4222 + 0.4222 = 0.8444. 


The desired probability is therefore 0.8444. 


14.4.6. Sampling Distribution of Variances. “he ‘sampling 
distribution of the sample variances calculated from a!’ possible randon. 
samples of size n from a normal population with variance 6%, is tne .o 
called Chi-Cqguare Distribution, which is discussed in chapter 17. The 
Sampling distrikution followed by the ratio ¢. two sample variances is 
called the F-distribution to be introduced ir chapter 19. 


EXERCISES 


“4.1 (a) Explain the following terns: 


` 


INTRODUCTION TO STATIS Tic. 
56 AL TREQ 


—— Ry 


== > ee 
Population; Sample; Sampling Frame; Parameter; Statist, 
(b) What is meant by sampling? What is the object of sam 


- dvantages of sampling over 
14.2 (a) Describe the a iss 
Taa TDs E.A/B.So,, 1989 


(b) Explain the term "non-response" in sample surveys, 


Pling? 


14.3 Define and distinguish between: 
(a) Target and Sampled Populations; 
(b) Probability and Nonprobability Samp ‘ing; 
© Sarıpling With and Without Replacement; 
(d) Sampling and Non-sampling Errors; 
(e) Random Sampling and Simple Random Sampling. 
(P.U., B.A./B.Sc., 1993) 
144 . (a) Explain sampling and non-sampling errors. What methods 
would you suggest to control each type of error? 


(b) ‘What is a biased sam 
bias may arise in 
eliminated? 


ple? Explain the different ways in which 
sample surveys. How bias can be 
; ; (P.U., B.A./B.Sc. 1989, 93) 
4.5 ae p reen categories of errors in data collected by 
a Tveys? Describe some of the methods for reducing these 
(P.U., B.A/B.Sc. 1960) 


types of sampling techniques used to gather 
advantages and disadvantages of each of 


14.6 Discuss the common 
information and the 
on these techniques, 
14:7 Discuss (i) gj 
o Dis imple ra 
(iii) Systematic » hon 


(a) Explain what 


(P.U., B.A./B.Sc. 1983) 


ample. Discuss how a simple 
cted. 


mple random gs 
l j Sample could be sele 
(b) Discuss the follow} 


ng: 
(i) The "Goldfish 


bowl" method. 


SURVEY SAMPLING AND SAMPLING DISTRIBUTIONS k 


14.10 


14.11 


14.12 


14.13 


14.14 


14.15 


3.4.17 


57 


(ii) The random number table. 


(a) What is a stratified random sample? In what way does it 
differ from a simple random sample, and what are the 
advantages and disadvantages of using this sampling 
technique? 


(b) What is meant by allocation of sample size? Explain how a 
sample is allocated in stratified sampling. 


Describe, in detail, various probability and non-probability 
sampling techniques. (P.U., B.A./B.Sc. 1981) 
(a) Distinguish between 

(i) Probability and Non-probability Sampling 

(ii) Sampling and Non-sampling Errors 


(ili) Multistage and Multiphase Sampling. 
(P.U., B.A/B.Sc. 1974, 92) 


(b) What is a cluster sample? Why are cluster sampies used? ` 
` (a) Describe the necessity of sampling and sample surveys. 


(b) What is a Systematic sample? Describe the procedure of 
drawing a systematic sample of n units from a population of 
N units. 


Describe the following types of Samples: 

(a) Random Sample (b) Quota Sample 

(c) Systematic Random Sample (d) Multi-stage Random Sample 
State the conditions under which each would be used and the 
advantages to be gained. (B.U., M.A. Stats. 1964) 
Explain each of the following: 


(i) Area Sampling (ii) Optimum Allocation 


(iii) Sequential Sampling (ivy Purposive Sampling 


(vy) Quota Sampling. Ys - E EA 
Draw all possible distinct samples of size two fromthe fo 
population: 2, 4, 6, 8, 10.. aa; 


nce: Í f the 
Calculate the means and the variances of he err vak A 
population. Discuss the-results. (P.U., _B.A/B 
ndom sample of 10 households 
g:a table of random 
(P.U.M.A. Stats. 1967) 


Explain how you would select a ra s 
from a list of 250 households, by usin 


numbers. ” 


` 


INTRODUCTION TO STATISTIC 


r AL THEORy SURVEY SAMPLING AND SAMPLING DISTRIBUTIONS 59 
2 ber table, seiect 30 samples of si l 
‘14.18 Using a random num ‘ 2 "+ Size 9 aay. (b) A large company has 300,000 emplo ees, th istributi 
with replacement from the following population distribution v of whom is shown as follows: Ployees, the age distribution 


heights. Find the mean of sample means. 
es 
No. of students | 


Age (years) 


` Height (inches) 
25 or younger 


26 — 35 
36 — 45 
46 — 55 
56 or older 


l (P.U. B.Sc. Hons. Part I; 1971) 


. A sample of 2 per cent of all the employees is desired. Design 
14.19 Draw, with the help of random numb 


) ers, a random sample of a sampling plan such that each age-group is proportionally 
size 10 from a represented. (P.U., B.A/B.Sc., 1988) 
(i) Binomial distribution with parameters P=0.4andn = 5; f 14.23 (a) What is a sampling distribution? Describe the properties of 
(i) Poisson distribution vith: the:paransetep ed the sampling distribution of the means. 


14.20 Using a random number 
Normal distribution with 


(b) What is the finite-correction factor? When is it appropriately 
used in sampling applications and when can it, without too’ 
great an undesirable conseque:.ce, be ignored? 

om sampling, explaining in detail thë 2 (P.U., B.A./B.Sc., 1996) 

à tion of sample sizer: 

(i) Proportional Allocation (ii) Optim 


table, draw a sample of size 30 from a 
H = 100 and o2 = 64, 
14.21 (a) Describe stratified rand 

following t 

g tynes of alloca 14.24 (a) Explain the difference between the population distribution, 

um Allocati the sample distributiort and sampling distribution. 
: location. (P.U., B.A./B.Sc, 1989, 93) 
(P.U,, B.A «/ B.Sc., 1991) 


(b) What is meant by standard error and what are its practical 
uses? Derive a formula for the standard error of the mean. 
(P.U., B.A. Hons. Part I; 1970) 


' 14.25 (a) Explain the difference between (i) a sample distribution and 
a sampling distribution (ii) a standard deviation and a 
standard error. l 
(b) Suppose a friend says, "I know the formula for computing the, 
standard error of the mean, but I do not understand what the 
standard error really is." Write a note to yov" friend 
explaining what the standard error really is. 
14.26 (a) Distinguish between a parameter and a statist.c. What is 


[195 | ‘d Error ʻe its practical uses? 
[163 | meant by Standard Error and what are i A 
aie r 2 i ° (P.U. B.A./B.Sc. 


to select à ‘childr 
by pre sort ct a stratified a | ; i mples of two children are 
DOM es an allocation, how ly Sample of sizn=40) (b) Assume that simple random -5a 1 latior. of five children 
“1 stratum? — ge a sample must we take | selected with replacement from a p 4 e of any child, find 
(P.U., B.A/B.&c. 1980) | with ages 4, 5, 6, 7 and 8. Let X be the age a. 


J | the following: 


" INTRODUCTION TO STATISTICAL y 
bo ooo HE AND SAMP l 
OR guRVEY SAMPLING LING DISTRIBUTIONS 61 


(i) The theoretical sampling distribution of X, the me 
of two children in any sample. an age 4, 5, 7, 9, 10. 


(ii) The mean and the standard error of X. 


(a) Find the population mean and variance. 


| | (B.Z.U. B.A/B.So. 1976) (b) Suppose samples of size n=3 are selected. Find `o? wh 
14.27 Given the population 2, 4, 8, 8, 10, 10. sampling is done. (i) without replacement, ii i 
(i) How many samples of size n=2, can be drawn ohh replacement. ent, (ii) with 
Pa from this population?” i), om (c) Select all possible samples of size n=3 without replacement 
(ti) Compute and tabulate the sampling distribution of the mean and calculate o? directly. 
x 


for samples of size n=2. 
(d) Select all possible samples of size n=3 with replacement and 


14.28 f A finite population consists of the numbers 2, 4 and 6 . calcu 
! . l oO ir ly. 
a ate direct: 


(a) Form the sampling distribution of X, when random samp] 
ples 


of size 4 are drawn, with replacement. 14.33 A population consists of four numbers 2, 4, 6, 8. Draw all 


possible sample of size n=3 with replacement. Find the mean 
and the median for each sample. Form the sampling distribution 
‘of means and the sampling distribution of medians. Which of 
these distributions has the smaller variance? How did the means 
of these two distributions compare with the population mean? 


awn from this Population, without 14.34 Given the following population distribution: 
— = 


(ii) Verify that Hè = Kand gs- -Z n [N-n 
É t h Vrai 


(iii) Between What two values wou 


of size n=2, that can be dr 


Find the sampling distribution of the mean if a sample of three 


thaw ws ni iis to fall? Id you _ i least : of the . ` n is taken without scan — an = are 
ssn = Fa element Population 0, 3 -U., B.A./B.Sc. 1993) the sampling basaa compare w the populatio s 
from Paa size n=3 can be i w 15, and 18. How 14.35 A random sample of size n=100 is taken from Pa ate 
Mean for ia ota Compute the sam re Without replacement having a mean of 20 and a standard deviation of 5. The shape o 
deviation of this eee 3. Compute gla of the the population distribution is unknown. . a 
14.31 Draw all possible x Wutio an and standard (a) What can you say about the sampling distribution of the l i 
Population 3, 6, 9 ang 1 ° Se n=3 with repl ho PERRE 
“ample means, Hence ir, nA Satnpliag distat sear (b) Find the probability that X will exceed 20.75. r 
(i) mean of the Ša ~~ and Verify the relation ele aie (P.U., B.A/B.Sc. 109e) 
Population em, ity distribution of the m 14.36 In a local agriculture reporting area, the average wheat yield is 
en ee known to be 60 bushels per acre with a standard pauri laei 


acres is selected and 


(ii) Variance of th 
bushels. If a random sample of 64 ple mean will 


> € sampli 5 
Population varia. Png distributi 
i Ee rw and 2 yield recorded, what is the probability that the sam 
‘Vo B.A./B.Sc. 1982 | lie bate ls? 
| een 59 and 61 bushels? 
approximately normally \ 


ues: 14 37 T l , 
| i he heights of 1000 students are N 
distributed with a mean of 68.5 inches and a standard deviation 


INTRODUCTION TO STATI 


STI 
60 ———_ 7 san T = CAL THEGRy 
Ea ga Thethesietiél sampling distribution of & the 


z of two children in any sample. 


(ii) The mean and the standard error of X. 


Mean a 


> . 
(i) How many samples of size n= 
replacement from this population? 


(j)pomput and tabulate the sampling distribution of th 
for samples of size n=2. 


14.28 , A finite population consists of the numbers 2, 4 and 6, 


(a) Form the sampling distribution of X, when random samples 
of size 4 are drawn, with replacement. 


2 
(b) Verify that [iz = j and o: -= ARAD 


14.29 A population consists of 2, 2, 4, 4, 6, 8 and 10. 


(i) Calculate the sample means for all 


possible random samples 
of size n=2, that can be drawn fr 


om this population, without 
replacement, 
(i) Verify that us = Wand gg = -Z . [Non 
P x yn N-1 


(ili) Between what two values would you expect at least ; ai 
sample means to fall? 


: (P.U., B.A./B.Sc. 1993) 
| 14.30 Given the six-element populati 


on 0, 3, 6, 12, 15, and 18. How 
Many s i hawn 
fay nol os n=3 can be drawn, witkoit replacement 
mean for se Compute the sampling distribution of the 
. . 9 | 
deviation of this distibution Tee mean and sans 
14.31 Draw all possible samp] 
population 3, 6, 


9 of size n=3 with replacement from the 
and 12 
sample means, H 5 $ 


(ii) Variance of t 


: e sambl as. . 
Population Vatianes, pling distribution of the mean and the 
14.32 A Population of N 


=5h (P.U., B.A./B.Sc. 1982) 
as the following valies 


Be 


B.Z. 
(B.Z.U. B.A/B 86, 179 
E tin the population 2, 4, 8, 8, 10, 10. 


2, can be dr : 
1 S e drawn Without 


e mean 


VEY SAMPLING AND SAMPLING DISTRIBUTIONS 61 
R —— 
s 4, 5, 7, 9, 10. 


(a) Find the population mean and variance. 


(b) Suppose samples of size n=3 are selected. Find `o? when 
sampling is done. (i) without replacement, (ii) with 
replacement. 

(c) Select all _— samples of size n=3 without replacement 
and calculate € directly. 


(d) Select all possible samples of size n=3 with replacement and 


2 as 
calculate O_ directly. 
x 


i consists of four numbers 2, 4, 6, 8. Draw all 
a Le caer of size n=3 with replacement. Find the mean 
sad the median for each sample. Form the sampling cree 
‘of means and the sampling distribution of medians, Which o 
these distributions has the smaller variance? How did the sesi 

of these two distributions compare with the population mean? 


14.34 Given the following population distribution: 


i three 
Find the sampling distribution of the mean if a si oes 
numbers is taken without replacement. How a opia aa 
the sampling distribution compare with the popula 


i 4 lation 
14.35 A random sample of size n=100 is taken -< A e 2 
having a mean of 20 and a standard deviation of v. 
the population distribution is unknown. 


i istribution of the 
(a) What can you say about the sampling distribution 
sample mean X? 


i X will ex 26.75. 
(b) Find the probability that X will exceed mit B.A/B.Sc. 1996) 


the average wheat yield 
ndard deviation of 1 

lected and the wheat 
he sample mean will 


14.36 In a leval agriculture reporting area, fe 
known to be 60 bushels per acre with a = 
bushels. If a random sample of 64 ie e 
yield recorded, what is the probability 
lie between 59 and 61 bushels?- 


14.37 The heights of 1000 students i 
distributed with a meap of 68.5 inch 


e approximately normally 


and a standard deviation 
s 


INTRODUCTION TO STATISTI: AL THEG 
- Ry 


of 2.7 inches. If 200 random samples of size 25 are drawn f 
this population and the means recorded to the nearest feng 
i of 


an inch, determine 
(a) The expected mean and standard deviation of th 
e è 
distribution of the mean. Sampling 


(b) The number of sample means that fa] between 67,9 and 6 
pia (P.U., B.A./B.Sc, 1977-9 


14.38 The heights of a large number of shrubs of the s ; 

produced for sale by a horticultural nursery are pate 
distributed with mean 1.14m and standard deviation 0 oe 
samples, each consisting of 100 shrubs, are selected In pvid 
rad > i would you expect to find the mean ok ‘le 
ope 1) greater than 1.16m; (ii) between 1.13m ee 


62 


14.39 (a) The followin 
g table shows the distributi 
schoolboy intelligence test mia A of 14-year-old 


(b) The random 
distribution: 


Mean H- and ya- ° 
ran = va-iance g? Š 
dom sample of 36. ; Of the mean Ž for a 


(ii) Fing the 
than 5.5, 
14.40 (a) Th i 

€ Mean o A caras 

the mean of cor Normal distribution ig 
Probability that S of 100 from that dist = eg to S.E. of 
distribution will e mean of a sale, pe Find the 
© negative, of 25 from the 


Probabilit 
Y that the mean of 36 items will be less 
(P.U., B.A./B.Sc. 1987) 


a 
the Ae 0.1 and a standard 
900 Yy that the mean of a 


simple rand i 
om sample of ne 
Mbers will be negati 
gative. 


gY SAMPLING AND SAMPLING DISTRIBUTIONS 63 


a OU 


suRV 
golution. (a) The variable X is N (s ; o) as 


o 
p=, the S.E. of the mean of samples of size n=100, 
And the S.E. of the mean of samples of size 25 = S- 
; 5 
From the standard normal variable Z = Ei we get 
o/\n 


or je} (0j oO 
x Z —= = > — 
H+ in TARAR 


Now X will be negative if E +Z s) <0, 
za g o 5 . 
Le.if Z <—-—x-— orif Z < -0.5. 
mar 


Hence using the Table of areas under normal curve, we get 


0.5 — P(-0.5 < Z < 0) 


P(Z < —0.5) 
0.5 — 0.1915 = 0.3085 


14.41 (a) A random sample of size 100 is taken from a Binomial 
distribution with parameters p = 0.5 and n = 40. Find, using 
the central limit theorem, the approximate probability that X 
is (i) greater than 20.5; (ii) less than 19.3; and hence (iii) 


between 19.3 and 20.5. 
(b) A sample of 36 cases is drawn from a negatively skewed ° 
population with a mean of 2 and a standard deviation of 3. 


What is the probability that the sample mean obtained will 
nts must we go from the mean to 


ple means? (P.U., B.A/R.Sc., 1988) 
14.42 (a) Describe the properties of the sampling distribution rs 
differences between two means. (P.U., B.A./B.Sc. 1983, 


be negative? How many poi 
include 50 percent of all sam 


| (b) Random samples of size 100 are drawn, 


with replacement, 
| . from two populations and their means X, and Xo compyited. 
If u; = 10, ©} = 2, Hg = 8 and Oy = 1, find the probability 
that the difference between a given pair of fia ara is 
(i) less than 1.5, and (ii) greater than 1.75 but less 5. 


INTRODUCTION TO STATI | 
STICAL THEORy y SAMPLING AND SAMPLING DISTRIBUTIONS 65 
—— 0 


64 
VE 
14.43 Let X, represent the mean of a sample of size n =2 se 
repl.se.*.ent, from the finite population 3, 4, 5. Similarly 1 With us = p and Var(P) a N-n 
represent the mean of a sample of size x = 2, with replac et È, : n N-11’ 
from the population 1, 1, 3. ement, 
448 A population consists of N=7 numbers 1, 1, 2, 3, 4, 4, 5. Draw all 
ples of size n=3 without replacement from this 


(a) Find the possible differences hetween the sample m possible sam 
eans of population and find the sample ’proportion of odd numbers in the 


the two populations. 
samples. Construct the sampling distribution of sample ~ 


(b) Construct the sampling distribution of ži = Xo and cain proportion and verify 
Pute 


its mean and variance. 
; an a2 N-n i 
a "E (i) ps = p, Gi) o; = n E . (P.U., B.A./B.Sc., 1989) ~ 
erify that Hz -E = Hy = Hy and o2 z= =i 2 
1 n ` 
2 ni g 1449 (a) Two per cent of the trees in a plantation are known to have a 
(P.U., B.A./B.S certain disease. What is the probability that, in a sample of 
14.44 The television picture tub 41 / B.SC. 1985) 250 trees, (i) less than 1%, (ii) mcre than 4% are diseased? 
mola i ubes of manufacturer A 
e of 6.5 years and neti: lenin ca Pou a mean (b) Suppose that 60% of a city population favours public finding 
-9 years, v'hile for a proposed recreational facility. If 150 persons are to be 


randemly selected and interviewed, what is the probability 
thet the sample proportion favouring this issue will be less : 


e 
® from manufacturer A will have a than 0.52? 
nan U. $ 


1450 A small, professional society has N=4500 members. The 


14.45 A random 
president has mailed n=400 questionnaires to a random sample 


having a me is taken f 7 
an of 80 n from a normal i 
random sample of ee, a standard deviation of 5 oar of members asking whether they wish to affiliate with a larger 
Lic having a Mean k taken from a different bomer | group. Assuming that the proportion of the encire membership 
nd the Probability th of 75 and a Standard devjati ao favouring consolidation is p=0.7, find the probability that the 
eid eviatton of 3. sample proportion P differs from this by no more than 0.05. 


(a} Dəscribe the sampling distribution of differences between 


le 
Pie’ mean computed from the 14.51 
nraportions and explain its usefulness in statistical inference. 


*36 Measurenients 


mean 
14.46 Wh ea tothe r ened ape oa ee ie 
k pp 7 the term li A | ( ; = 45 are drawn i 
USefu nese ae cribe its Stone distribution ‘of sample = are oean satples of Sizes nis ~ A What is the 
tistical inference ant. properties an dt alain ite from a binomial pcpulation with p=%.1% 
` EE probability that —0.1 < P. — Pg < 0.1? 


s 1, 2, 3, 4; 5. Draw all 


14. 
47 A POf-ulation 6 
Find the mean of the 


| 1452 A population consists of five observation 


all possib] L 
e 5S; . =6 number 
Population a es of size p= : S 0, 3, 4,6 9 
nd fi n=3, with ? 9; and 15. Draw Í 
t nd th out r opulatior e ob 
€ samples, Constenet a e Proportion of ment from the | ie a of size e serge 
von numbers in . ing distribution of the 3 
. - (P.U,, B.A./B.Sc. 1990), 


variance of the population. 


d verif ne sampling disu 
f & distribut; 
Y that S-tibution of sample 


66 


€ INTRODUCTION 


14.53 Show that the variance of the sample mean Yf 
, r 


random sample of size n drawn without a gi 
population of size N is given by VERI lacement feos 
a 
sa 2 me 
ica 
n (N-1) 


(P. 
U., M.A. Stats, 1966) 


14.54 (a) State and prove the central limit theorem 
P. i 
(P.U. B.A. Hons. Part I, 197) 


‘(b) Explain why the central limit theore 


Statistical Analysis, m is so important i 


C S 
oe a0 0% o% 0% o% 0° 
ro? “PMP Me Oe eo eo eo ef 


TO 
STATISTICAL Th } 
ony | 


‘teject any specified’ statement or hypot 


Statistical Inference: 
Estimation 


| 15,1 INTRODUCTION 


The process of drawing inferences about a population 0 Ft bir 
information, contained i in a sample taken from- the population is called 
Statistical Inference. Statistical inference is traditionally divided into two 


| major areas: estimation of parameters and testing of hypothesis. 


Estimation is a procedure by which we obtain an estimate of the 


true but unknown value of a population parameter by using the sample 


observations X 4, Xo ..-, X n from the population. For example, we may 
estimate the mean and the variance of a population by computing the 


| mean and the variance of a sample drawn form the population. 


Testing of hypothesis is a procedure which enables us to decide 
on the basis of information obtained by sampling whether to accept or 
hesis regarding the value of the 
parameter in a statistical problem. 


We shall discuss estimation in this chapter and we shall deal with 


testing of hypothesis in the next chapter. 


| Instance ; 
c ; ; 
e, if X}, Xa, ..., X, isa random sarap 


15.2 ESTIMATES AND ESTIMATORS 

ob > n estimate is a numerical value of the unknown parameter 
nena Wy applying a rule or a formula, called an estimator, to a sample 

mae “v Xp of size n, taken from the population. In other jaar an 

Dar; ator stands for the rule or method that 1s used to es be a 
ameter whereas an estimate stands for the numerical value obte.ne 


i _bstituting the sample observations in the rule or the formula. For 
le of size n fror: a 


67 


populet:?2 ` 


68 INTRODUCTION To STATISTICAL A 
Hi 


1 Sy ; : 
with mean p, then X = ~ 24i is an estimator of u 4, PA 
È £ a 


numerical value of X, is an estimate of 4, 


The syrivol O (the Greek letter theta) is customar 
an unknown parameter that‘could be a mean, medi 


standard deviation, while an estimator of O is co 
(read theta hat), ie, by placing the hat (*) over th 
the parameter or sometimes by T. It is of int 
estimator is always a statistic which is a fu 
observations and hence is a random variable as t 
are likely to vary from sample to sample. 


ily used to deny 
an, Proportig 
mmonly denoteg 
e symbol repre 
€rest to Note 
netion of th 
he sample ob 


n or 
byg 
Senting 
that a 
e Sample 
Servations 


There are two categories of estimates: 


e 
value between 69" 
- interval estimate, 


other hand, 


"and 66", then the range 


€ unknown Parameter, 


ual to th 
e many p 


S to be noted that a point estimate 
e Population parameter as the random 
ossible samples which could be chosen 


If an estimate 
observations (ie. as 
estimate, For example, z isali 

It can be expressed ag one 


can be expr 


; essed as 
a linear com 


bination) 
estimate 


a sum of the weighted 
» it is said to be a linear 


of the parameter pı because 


= 
T ar. 1 
n ME Tt choy 


» Tw 


Eon, i 
‘sich 8 


: 69 
L INFERENCE: ESTIMATION l 
A 


Po r combination of the values of X’s and in terms of 
a linea l j i 

each observation is given a weight equal to P 

hts, . 

a 1, A random sample of n=6 has the elements 6, 10, 

mple -A "Compute a point estimate of (i) the population mean, 


Exa : 
dard deviation, and (iii) the standard error of the 


18 and 2 
ean. 


(i) 


W 
6) 
The sample mean is 

EX; 6+10+138 + 14+ 18 + 20 


Zeama 6 


n 


81 : 
=— = 13.5 
6 


. . y is 
the point estimate of the population mean p is 18.5 and Xi 
Thus 


the estimator. 
(i) The sample standard deviation is 
| S= S L(x; - X)? 
1225 _ a) 
j =V Ti 


n n 
y N = 4.68 
= \/204.1667 — 182.25 = 21.$167 = 4.6 
i rd devi 
Thus the point estimate of the populaticn standa 


468 and S is the estimator. 


(ii) When the sample size is les 
standard error of the mean 1s 


ation © is 


s than 5% of the population size, the 


j E imate the S.E. Oz- 
| We use the sample standard deviation S to esti 


4.68 _ 4.68 _ 191. 


Be 245 
list 
; Where Sz is the estimator for Oz and 1.9 
, Mandard error of the mean. a 
. i a ry 
15.3.1 Criteria for Cino Potak autos ; 
| Considered a good estimator if it satishes 


' titeria are: 


Thus 


he point estimate of the 


A point estimator is 
riteria. Four of these 


cy, and (iv) sufticiency. 


rn . i one efficien 
(i) Unbiasednass, (ii) censistency, Gii) 
OW wa discuss these properties in tura. 


71 


INTRODUCT 
ION FO STATISTICA, FERENCE: ESTIMATION 


> : ; timator i qicAL IN OE: 
0 Dahieppinash, An e m aod bee ne be Unbiasey tls “ 15.2. Let Xp Xo =» Xn be a random sample from the 
statistic vsed - value equal to the! ae with a mean of p and a variance of 6”. Then show that the 
t 0 


value of the population parameter being estimated. In other 

be an estimator of a parameter O, then O will be Calleg 

estimator if E(0) = 0. If E(@) # 9, the statistic is Said ef biasa 
ea 


k plat 


i vance s2 = 


n 
| 1 > (X; — X)? is a biased estimator of 02. 
ople yar n. 


i=1 
ariance is defined by the relation 


estimator. The estimator is defined to be positively }; 
ively biaseq sed le.v 

. and it is said to be negatively biased when E(Q) < Q, len EQ), The samp À 

property that requires that the probability distributi sednes isa S? = Š La- x)? 

necessarily centred at the parameter 0, irrespective of the ine a 0 be "isl 

Let i 7 Hic 
et us consider the sample mean X as an estimator of th = iz KX- H - &- p)? 
population mean p. Then we have 8 = Aa 1 2 i 
=uandĝð=ğ-4 z 
' = n 2X -IEX — w?- 2 A-W EA- +n -oi 
Now E) = 7 2 es 5 
(8) Bm) = 8 [> x, ] = EX; - wt 2n Rw? + n X= W?) 
t=] 
C EX- pw) =2 (X - )] 


s 
= nE R +X +.. +X] 1 È 
1 = = D(X; = u)? a (X iT u)? 
= BX) + BX) + + E a 
T 1 Taking expected values of both sides, we have 
1 
=— (nil) = ee i y — 1)2 
al C7 X;’s are drawn from E(S?)= E [ra - p)? - -p ] 
Thus we see th a population with mean p) 1 2 
at the samp] ' = ZEX- W-E B-W 
m i 
j of any Population, p'e mean X is an unbiased estimator of the n f : 
The sample pn - LS Var(X) - Var® = iz S- z 
=. L 
n 


oportion P j 
meter p as ìs also an unbiased estimator of the 


2 -1 
foj n-+)| o2 
oS. a ) o 


This shows that S? is a biased estimator of the varia 


=% 
=p (Xi ; ; m 
Similarly, thes 4 ìs the number of successes in n trials) an unbiased estimator of o2, we should multiply the samp 
the . tiple median + | n n 
population iş po, edian is also an unbiased estimator of p when j mor Thus, writing s? = E is 


_2_ (52) = 0? 


viased estima aly distrib i | 
tor of G2 a uted. But the sample variance S$? is# | [ | 
n ale 
BQ) = EL- ed 
n-i k 


r 
S*) # g2 
It then follows from this result that 


s? = : ¥ (x, - 0? 
n-1 


nce 62, To get 
le variance s? 


INTRODUCTION To STATISTICA 
T 
72 


‘ee : Sony 
i f the population variance o2, 
iagr.d estimator 0 
is an unbiasé 


Alternatively, 
E(s?) =E (a=; (x; ~ X)2 ] 
n-1 
=~ E[DK;-w?-n &- u)2] 


n-1 


= [ZE Œ; - W- n BCR - p2] 
Te 
= s [£ Var(X;) — n Var (X)] 

7 


= [no-n É] - C00 


o2 
n-1 n-1 


It is to be noted that, for large samples, S2 
- estimator of the population variance g2 
=0?/n, becomes negligible, 


becomes an unbiased 
as the amount of bias, which is 


, a 1 . 
Moreover, the quantity = 2 (X; — 1)? is an unbiased estimator of o? 


but pis usually not known, 


Example 15.3. If random Samples of size 2 are drawn with 


replacement from a ; "a 
Populatio rs 2, 3, 6, 8 
and 11; show ty fitting n consisting of the five numbers , 


i ll possib] X 2 are unbiased 
estimators of u and o2, p e samples, that X and s2 are 


Population Consists of 2, 3, 6, 8 and 11 


si 2 +846 

H N ee a N 
UX, ~ p2 

o2 = t u) 
a 

= 2~6)2 


78-624 4 (1-6)? «BS = 10.8 
5 


ble s bebe 
the values petite 2 which can be drawn with replacement 
and s? arg given as follows: 


` The possi 
together with 


Now X is an unbiase 


obtain the sam 


pling distri 


de 


73 


d E(X), we 
stimator of } if E(® = p. To fin 


pution of X as below: 


tay | A Probability /@) E 


w 


74 INTRODUCTION TO 


150 
Thus ED = Lr fe) => = 6 = ph 


Again s? is an unbiased estimator of the 

: populatio 
E(s2) = 0°. To find E(s), we have the sampli he 
follows: i a distribu 


arlance o2 it 
tion 
0; 
f 52 A 


vie 


Hence the resu] t 


Exam 
ple 15.4 
rectan < ->44 A rand 
gular distribution om sample of size n isd 
rawn from the 


‘for O if EX) = 6, We 


E 2 
both X¥ and X; are unbiased 


Now g 
in f= ‘i ad is estimator of |) 


x 


EaD = (8) 


Hence, 2¥ 
e, 2X is . 
an unbiased estim 
ator f 
org 


ST, Tio 
STEAL Theon 
= Y 


gTICAL INFERENCE: ESTIMATION 7 
na oa A 


sTAT! 


E 


parameter Í 
Let the random variable X have the binomial distribution with 


arameters n and 8(=p), 
hen EW = n9 and Var(X) = n8(1 — 8) = nð — n8?. 


xample 15.5. Find an unbiased estimator for 0? i 
n a binomial distribution. where 0 is the 


T 
put Var% = E(X?) — [E (X), so that 
EED = Var(X) + EW? 
= n8 — nð? + n?02 
or E(X2) — nO = n?0? — n92 = n(n — 1) 02 


EXD -EW _EKK-V) 
© ER) - EO) ERX- ng = Ee) 


02 
n(n- 1) n(n-1) 
h is the unbiased estimator of 02. 
sample linear regression be Y;=0+ BX; +e}, 
where each Y; is normally distributed and X;’s are fixed. Then shew that 
a and B are unbiased estimators of the parameters & and B: 
The sample regression coefficient may be expressed as 
pr IXY- TALY 
nX? - (XX? 
n ZX (a + BX + 6) - EXL(a + PX + 2) 


= n£ xX? - ER? 


_ na DX + nB 2X? + n Exe — na DX - PEW- TXT 
. nyx? — (LX)? 


j nX? - (LX)? 


L-e 


whic 
Example 15.6. Let the 


Taking expected values of both sides, we get 


A Z-Pe 
E(B) - 18+ =m! 


=p+ LA-H ES) C. X's are treated as constant w.r.t. 
B DA= x)? 
this expectation) 


[E(e) = 0) 


INTRODUCTION TO STATISTICAL TH 
Eory 


76 
Thus B is an unbiased estimator for B. 
EG) = B(?- BX) eee 
Now El a z 


LY, Ayl -FEB 
=p[=-BX = = ZEY; X E(B) 
„20t pX) _ XB 


=a+PX-BX =a. 


A 
Hence, & is also an unbiased estimator for a. 


eg E(Y;) =Q4 BX 


(ii) Consistency. An estimator is said to be co 
statistic to be used as estimator becomes closer and closer 
“population parameter being estimated As the sample size te a 
Stated a little differently, an estimator O is called a consist 'nereases, 
of 8 if the probability that O becomes closer and closer t 6 = 
unity with the increasing sample size, Symbolically, 6 Pillans 


ð i i 
estimator of the parameter 0 if, for bi ep ee 
quantity e, arily small positive 


nsistent if the 


D which is an unbiased E 
estimator of kl IS ac i 
, onsistent 


estimator of the 
i the mean H, Th 
ig of the parame: i 
Istribution, The Median j , 
Population has a sk ae 
ewed 


: T proportion P is also a consistent 
population that has a binomial 


a consisten i 

g-ig distribution. pa Se 
eee nd (X~ Re though a bi oe sample’ vanin 
; 'abed estimator, is a consistent estimator 
ndard error variance g2 

À Y deer it may b 
c ea, ; es reer 
Onsistent, It Should tonn h ith the i a that a statistic whose 
ed that cona; ng sample size, will be 


To proy at co 
e nsi . 
tise that an estimator i Stency is a large sample property. 
Quite usefy] a fon tent we may stat Sea 
” 88 follows: , ate a criterion 
9 ba 
A sed on A 
are) ~ 4 sample of size n, Then Ô is a 
0 as n > œo" 


i 
Ny the Sa 
m 
ple mean, based on a random 


: ESTIMATION 
graTISTICAL INFERENCE À 
2 
E(D = u and Var(X) -s l 


2 
2 approaches 0, i.e. Lim Var(X) = Li S 0 
n n=% n>% Nn : 


“Now as n > %, 
Hence X is an unbiased and a consistert estimator of p. 


Also, for the sample proportion P (- 2) we have seen that 


EP) =E 8 = p, i.e. P is an unbiased estimator for p. 


Now Var(P) = Var a = = Var 


h/ 


Š = (npg). C. X is binomially distributed) 


` Since the value of = approaches zero as n tends to œ, therefore 


A 


P (- A is a consistent estimator. 


Hence, the sample proportion P (-3) is an unbiased and a 


, consistent estimator for the population proportion p. Es 
efined to be efficient if 


(iii) Efficiency. An unbiased estimator is d 


i istribution i ler than that of the 
th i i ling distribution 1s smal 
A a of Oh unbiased estimator of the same 


sampling distribution of any other i imators 
parameter, In other words, suppose there are ae omen more 
T, and T, of the same parameter 9, then Ti a ane efficiency 
efficient estimator than T if Var(T,) < eea Var (T2) hich i 
g _——, W ich 1s 
of T, compared to Tg is given by the ratio By = Var (Ty) different 

l ; i ing differ 

greater than 1. It thus provides a criterion for comparing | 
Unbiased estimators of a parameter. for a population thet 
or 

Both the sample mean and the s ent estimators of H, 

2 


has‘a normal distribution, are unbiase 
ribution of sample m 


ample median 
d and consist 


O*)\; 
=| is 
eans ( n 


but the variance of the sampling dist 


INTRODUCTION TO STATIST 
ICAL TH 
EORy 


no 
„maiar than that o: the sampling distribution of sample medians (ze 
: , z) 


Var (median) _ 207,” = Z = 1,57 > 1. Henc 
„I0 =5=1. : e the sam 
2 ple mean is 


Le Var (X) 2n o? 
more efficient than the sample median as an estimator of u, the R 
3 ample 


mean may therefore be preferred as an estimator. The efficien 
sample mear. relative to sample median is 1.57 or 157%, Beid of the 
that a sample mean calculated from a sample of size 100 can doth Means 
` job as the sample median calculated from a sample of size 157 e same 


If an unbiased estimator ô has smaller variance than 

unbiased estimator, it is called the minimum variance pei Other 
estimator of 8. We generally prefer the unbiased estimator whi oe 
minimom ariane An ‘unbiased estimator having the cat the 
variance, is za'ied the best or most efficient estimator for 9 Paa 
intrest to note that the variance of O cannot become a R de at 
certain lowe” bound and a variance equal to this on men a 

called 


Ar. estimator 6 ; 
that is linear, i 2 
Variar ce among all }j f rt, 1s unbiased and has mini 
g all linear unbiased estimators of O, is called L 
. a best linear 


A e value 9 
MSE (8) =E _ 9): iii 
=Ẹ 8 A, Pa 
= -ey a 
= Var @) 4 bi ~ OP + 2 EÊ) - oye(6 - zô 
That is the as a A 
s > mean soy A C7 EL — E(0)] = 0 
estimator plus the hai. aie error of 6 is o ( )] ) 
Jatiance of Ô, that is į eà bias If 2) = g qual to the variance of the 
A In ao E $ 
fi SE would coincide, It jg be of unbiased sa E(0 — 6)? is equal to 
wki comparing two estimat € Roted that estimator, the variance and 
ich gives the smaller MSp. 1 and § Ww is an important criterioz 
about the paren, is prefer an estimator 
-neter to be estimated. 


EN CE: ESTIMATION 
ps TCAL INFER Bi 
gxamyle 15.7 Let X1, X2, X3 and X, be a random sample of size 


Np, g2). A statistician wishes to estimate the mean by using 


24 from a . i 
p ofthe following two estimators of the mean p 
i 
e X, + Xo + X3 +X, 
= ————_———,, (the sample mean, X) 


T,= 7 
X; + 2X; + 3X; + X, 
aia WT i Žo) 
Which estimator should be preferred? 
To answer this question correctly, we need to compare the expected 
values and the variances of these two statistics. 


Let us first see if the two estimators are unbiased. 
X, +X +X;+X 
Now Br) =r E] 


E [X, + X; + X; + X4] Zaw =y. 


Aje 


T,, the sample mean X, is an unbiased estimator. 
X, + 2X; + 3X; + X, 
sa Anaa] 


Again E(To) 7 


= JEX) + 2 E(X,) + 8E(X3) + E(X4)] 
= = [pt + 24 + 3H + u] = p, Le. 
is also an unbiased 


T», the weighted mean of the observations, 
unbiased estimators 


estimator of pı. Thus we see that both means are 
of u. 
Next, we find their variances. 
XY + Xo + Xa + Xa] is 
— r ga 4 ’ 


` Var(T,) = Var ie 
X, + 2X, + 3X3 + X1 
Var(T5) = Var eit het Ea] 


OX. 3X. X4 
X 3 e (=) 
= Var 5 + Var c + var( 7 ) prea 
1 
9 —Var(X4) 
4 > Var(X3) + 4 
= FV + gg ares) + 49 Varla 49 


> 


80 


Var (T9) 15 997 _ 80 Vain; 
= 2 32 + — =-=, which is greater 
TE- rE wen r than 1, Showing 


that Var(7,) < Var(T,). Thus T} is more efficient than p 
2 as an 


estimator for LL. 
Hence 7,'= X is a better esiimator of than T, = X and sh 
w ould be 


preferred. 

(iv) Sufficiency. An estimator is defined to be sufficient ; 
statistic used as estimator uses all the information that is : if the 
the sample, Any statistic that is not computed from all wine in 
sample is not a sufficient estimator. The sample mean .X is ues in the 
estimator of u. This implies that X contains all the ela nal 
oe relative to the estimation of the population paramet a fhs 
other estimator such as the sample median, etc, salaulated f let 
sample can add any useful information concerning u Sn ite 


l p i i e 
p p rt 1 Iso i t mato f h 
i : sample p roport on P Sa a sufficient esti r o t 
p p on proportion p, The concept of a sufficient aa i i t d 
opulat: z Intro uced 


eke Da ( , = 
) n 
b Sir R A Fisher 1890-1962 1s of par ‘amount impor tan ce } S 
y tatistics, 


Math 2matica)) 

N ai : aly, let x; Xx, 

Probability distribution ftes 6 D wy Xy de at 
distribution, Then 6 [=k veh ant 
and only if. a (x), Xo, 


a: ‘fandom sample from a 
x 

ee: tP Žo + Xn} O) denote its joint 
’ “nis 18 a Sufficient estimator for 0 if 


fix, x ii F A 
D42 a 8) = g0; 8) A(x), Xy x,) 
dong Xp 


t 
sufficiency SN 
i eyma ctorizati 
n Factorization criterion for 


Poisson distribution fe; 8) 


aE 


is i 
a sufficient estimator for Q 


INTRODUCTION TO STATISTICAL TH 
čO 
SRY 


INFERENCE: ESTIMATION 
TICAL 2 y . g1 
PY the Poisson distribution 
n 


Give -0 0*7 


e 
fi; ®) = Sa for x = 0,1, 2,... (© is used for p) 


spe joint distribution of the sample values is 
fær Xg oe Xn) 8) = f(xy; 8) faz 8) ae fei 6) 


e-0 0" e770 9 


in) ži X4! xq! ... x,,! 
e779 (n8)* (Xx)! 


Now if we write 
z3 -n0 Exi 
= en" (nf) a distribution of x; that depends on 9, and 


Èx)! 1 : f 
= = t t depending on 0; 
hiema Zp) Ey! Xgl Xp! nE’ a function not depending on 


we see that the joint distribution has been split up into two functions, 
criterion. This shows that LX; is a 


which satisfy the factorization 


sufficient estimator for 0. 
Hence, X is a sufficient estimator for O as any one-to-one fu 


nction of 


Bis also sufficient. 
a Xp be a random sample from a 


Example 15.9. Let X}, Xù -> 
: F hat X = 2s 
normal distribution with mean =}t and variance =1. Show tha n’ 
the mean of the sample is a sufficient statistic for the parameter pt. 
| The normal distribution with mean =p and o? = 1, is given by 
| flr; pt) = ee ee 
| The joint distribution for the sample from the population's 2 
1 —(x,-p)*/2 
2 a (Sn P 
feitan) = earm? A a als 27 _ 
2n 21 
n © 
= TI 1 e-m m?/2 
fed 2r 
1 -Y(aj-)?/2 


i (2n)n/2 


AD 


INTRODUCTION To 


we hav” 

£ 3 ` [e -F) + Ep]? g2. This implies that i 
-u = a | gator rO > 

< a l g" 


2 
n,s 2 
=D ,-2? + DE-p? E = = 02 or EnS) = (n1 - 1) 0%, and 
: "ne 
= D (x;-7)? +n E- 2 f ‘ 
S 2 
oe! -n(Ep)2/2 ~X(x-z)279 ; rah = 0? or E(nS3) = (nq - 1) o? 
(ky aka a (amyl? e ʻe | E [ 7 = 2 
i = B(%, H) A(x), Xo ..., £p), Adding these two results, we get . 
= ; 2 2- = 2 -1) o? 
where g(Z, p) depends on X (estimator) and p, (parameter) ai EnS} + E(ngSq) = (n; = 1) 0% + (ng 
A(x),...%,° depends on x and ¥ but not on p. ` Psa 5% siirt mei o2 
Thus the joint distribution has been factored into two factors or as a + n22 i 
satisfying = factorization criterion. os s? +n S3) E 
Hence, X is a sufficient statistic for the parameter Ll. so i a 


ny + No 2 
15.3.2. Pooled Estimators fr 

we need to estimate certain para 
values from two or more random s 
population, 


om Two or More Samples, Often 
meters by pooling (combining) the 
amples taken usually from the same 


tw ti tic (n S + NS )/ 
Hence, on the basis of ie) samples, the statis 1i 2 
? 


i istic i erall 
m+n S 2) is an unbiased estimator for o2. This statistic is ea a 
2 . . or 
nae by S? and is called the pooled unbiased estimator 
p 


variance 2, 


? hre 
These results can be easily aro“ aia 

samples taken from the same population. ¥0 for p is 

tandom samples, then the unbiased estimator for } 


e or more random 
le, if we take three 


ni + ny 
i i ; 
San unbiased estimator for Hl, since 


r e a c nı +t ng +t Ng 
£QR)= gl aee] — j = $ iance 6% 18 
ni + Ny ET P E [n, X + no Xo] and the pooled unbiased estimator for the vat 
2 
si É F 2 2 +n S 
rtm MEX) + ny Ry] @ ee 
a fake P n +ngtng—3 : nå ng end 
ni +n, inih + Molt] = y smial POD en 


Likewise, if we take two randon o ort p 
ample Proportions P, and Po from he unbi‘.sed € 
Unknown Proportion p of successes, then t 
*sed on pooled data, is given by 


opulation 


In order to fin stimatos for P» 


an unhj y 
0%, based on the ty, n biased est 


mator for the population variance 
Wo S; . 
: following: amples with va 


riances s? and s$, we consider the 


: ON 
: INFERE aes 
be STATISTICAL TH se eS ae n ; ; 
a: , Eon [se already seen that the stutistic S? is an unbiased - 
We can write ; f 


IMATION 


o ; 
INTRODUCTION TO STATISTICAL , INFERENCE: EST 


; nit nok _ (40) (20.0) + (60) (19.65) x 


z sas al 
ga et 40 + 60 19.79, 


et. ORy 


ry A 
P, + nP? 1 5 
p [22A] -i EP, + nÊ 
ZL- p +n nı + no 202] 


1 a 
z A f "62 is 52 , 
n, + Ny [n E(P) + N2E(P,)) in unbiased estimate for O% is s, where 


oo 24 nS 
ATRAL =p. : ny S1 # M22 _ (40) (1.8) + (60) (1:96) _ | 2, 
p mtm? 40 + 60-2 


We now give an illustrative example. 
n the basis of two samples, an unbiased estimate of p is 


Example 5.10. A sample of 40 observations from a populati Hence, © š z 
unknown mean p and unknown variance O? gave Sx - og With 1119, and an unbiased estimate of 6? is 1.73. 
DX? = 16052. 00 an POINT ESTIMA 
A second sample taken from the same population gave M ei MOS 
A point estu ator of a parameter can -be obtained by several 


methods but we shall consider the following three methods only. 


(i) The Method of Maximum Likelihood. 
(i) The Method of Moments. 
(ii) The Method of Least-Squares. 


These methods give estimates whi:.n may differ as the methods are 


7 a 
Using the data given by the two samples, fi i ; 
u and 02. i ples, find unbiased estimates of 


We first compute the sample means and sample variances 


For sample ve ; 
sed on different theories of estimation. 
ï= 2. ed = 20.0 15.4.1. The Method of Maximum Likelihood. The method or 
Bs 88 le principle of maximum likelihood, abbreviated ML, which is a very 
; yx? Sx 2 weful method of estimation, was introduced in 1922 by Sir Ronald A. 
fo) = (By Fisher (1890-1962). The principle and the underlying logic is “to 
Ny 40 40 wnsider every possible value that the parameter might have, and for each 
= 401.3 - 400.0 =1.3 value compute the probability that the given sample would have occurrec 
For sample 2: that were the true value of the parameter. That value of the parameter 
Dy: for which the probability of a given sample is greatest, is chosen as an 
a ae 1179 | woga Estimates obtained by this method. are called the maxımum 
M gee itditood estimates (MLE). The :4L estimators are consistent, efficient 
2 Dp? | m cient but not necessarily unbiased. The method of ue 
ri al = 23285 (az A Wishes is applicable to both discrete and continuous ran 
"a 60 U60, ' 
= 388.0833 -- To illustrate, let ider a sample of n=10 rocks from a bincma 
unb Now we calculate, using , 386.1225 =1.96 - "pulation with ccna paani B where p denotes the ied 
nbi ; 3 USIN ; 7 ri : A , . tX=- 
ased estimat 8 the information given by the two samples to in a riverbed that are sedimentary in type Suppose that io 
c Now the problem 15 


Say, 
„> are found to be sedimentary in type. 


ès of Lt and o2 
ima 
te the unknown parameter p from the samp 


le data. 


An unbj 
ased esti 
mate fo 11; ; 
His the combined mean Z, where and the only 


Tor i ‘ $ X 
variable the given sample, both n\=10) and ate Je consider all 
tsip, the Value of p. Using the principle of 7, 0.8, 0.9, D and 


value of p(p=0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, YA 


86 $ INTRODUCTICN TO STATISTICAL THE AL INFERENCE: ESTIMATION ne 87 
ili h possible value in order t ORy not? Se 
the probability for eac Sia © choo a ae ae 
ae saat vale cf p for which the probability of the given s a an 7 _p)- 3p = 0, which gives p = 5- 
Pre jy. 
greatest. a o that the first two values give a minimum, and therefore our 


' By the binomial formula, th probability of X=4 when n- T yese 
#2. 9A. 
gonte SP“ 5 


3 mathematical technique of finding ML estimators is presented 


pis ne 
_ (10) anpe 
pissr)=(q)e4-p) 


Evaluating the expression for all possible values of p, we obtai i : 
in 5 
sat Xy Xp =» X, be a random sample from a probability 


savation f2; Q), where O is a single unknown parameter. Then the 
goi distribution for Xi, Xo «+ Xp is 


flp žo = Xni O) = fæ 0) fxg 9) ... fen; 0) 


I] fx; 8), 
i=1 


< maximum likelihood sis | 
ssthe observations in a random sample are statistically independent. 


This joint distribution regarded as a function of the parameter 8, is 
uled the Likelihood Function of the sample and is usually denoted by 


n 
the symbol L(8), i.e. LC) = TI f(a; 9). 
` isl 
Now we wish to estimate the unknown parameter 0 by that value 
which maximizes the likelihood function L(0). Such a value, called the 


L0) 


n a = 0 for 


Of all possi 
comesponiing ta he no the probability of the actual sample deta 
makes the sample r p=0.4 is greatest. In other words, p=0.4 


likelihood estimate, 


ML estimate, is obtained by solving for O the equatio 


x, ZLO) ; : 
Mathematically, the nosis; arom < 0, provided that L(Q) is a differentiable function of ð. 
equating the derivatiy e position of the 


e of 


m of the 
that the 


maximum value b i 
Sy a NSE In practice, it is often more convenient to find the maximu 


vin logarithm (In) of L(Q) instead of L(9) because of the fact 


Map) = (i pi (1 —p)6 m i of 
Ww.r.i p to ze 4j? =p) amum of Ln L(®) occurs at the same point (value) as the maximum 
T ro and solyj ` L 3 Vee 
K4 10 OAA gs O Then the ML estimator is the solution of z (In L(8)] = 0 
z ? ) = 
a (o (1~p)6 a 
— 6p4 Which © 
Or ap3 tat lel hapa Un LO) < 0 
2p% (1 =p 2 =p) Ae . 0. N 
= D = i l 3 may 
i We have considered only one parameter Ø. This metho 


w either p? =0 


? Which gives p = 0: tendeg t 


ene A 0 two or s ` ters. 
p)? =o, Which Bives p = 1 ai method of 


(1 
T ' ‘ation of the 
ML, i following examples illustrate the application 2 


INTRODUCTION TO “747; 


Example 15.11 Consi 
fæ) = pq”! for x = 1, 2, 3,... 


ikelihood estimator for p when 


88 — ee ete eto 
der the geometric distribution Ri 


„Find the maximum l 
(i) only cne experiment is performed, 


Gi) n experiments are performed. 


We find the maximum likelihood estimators for p in both situatio 
ns ag 


follows: 


(i) Since only one experiment is being performed so the likelip 
ood 


function would be 
Lip) = fx; p) = p(t - p)?-} 
The natural logarithm (/n) of the likelihood function is 
In Lip) = inp + (x — 1) In (1—p) 
Differentiating w.r.t. p, we obtain 


0 I gal 
> [n L(p)] =-=- = 
Op P 1-p 
Equating to zero, we get 
TORTE C E 


P(1—p) 
or 1-px = 0 gives p = 2, 
s. . . : 
Gi) The likelihood function for n experiments would be 
LQ) = flx,;p) fxg; p) ... Fesp) 
n 
=i 


Toa =p] = pn (1 — p) izn 
The natural logarithm (In) of the lik 


In Lip =n In p + = 1l 

) (Èr; n) In ( 

Differ entiating Ww.r.t p ve 
he D we 


elihood function is 


=p) 
get 


ô 
zi se o 
dp Hon = 2 Bima 


la 
Equatin i 
g to zero ang solving for 
pet) =_ 
` os p | 
Hence, the MLE of 


cAL INFERENCE: ESTIMATION = 89 

ge je 15.12. Let Xy, Xo .... X, be a random sam le of si 
pam? Bernoulli distribution iii 
pio 

J fle) = P x=0,1;0<p<1. 

e maximum likelihood estimator of p. 


z glx k 


pindth 
qhe lik 


g5 


elihood function for a sample of size n with values XX, 


n 
Lp) = Il i ii (Here 9 = p) 
i= 


= p” Q-p) =i 
The natural logarithm of the likelihood function is 
In Lp) = Lx (In p) + (n - Dx) (In (1-p)] 
Differentiating w.r.t. p and equating to zero, we get 


Z, a-zo] 
1-p 


d ——— 
g” L(p)] = D 


Solving for p, we get 
(1-p) Lx -p(n — 2x) = 0 


| Which is the MLE of p. 


Example 15.13. Let X}, Xp)... X, denote a random sample from 
a x , 

i x = 0, 1, 2,... Find the maximum 
x! ’ ’ ’ 


Poisson distribution, f(x; À) = 
likelihood estimator of 2. (P.U. B.A./B.Sc., 1987, 93) 


The Probability function of the Poisson distribution is 
x 


ah 
fes A) =X, for x = 0,152 


Be si 


X ni likelihood function for a sample of size n with values Xy 
n 


LO) = fy (x43 A) fo Gai D ww fan A 


_ emh (A 
Hy! Xgl. Xp! 


Taki 
aking natural logarithm (Jn) of the likelihoo 
In L(®) = -nà + Zx (In A) - 


q function, we obtain 


in (ey) in aP e 


INTRODUCTION TO STATISTICAL THE 
ORy — 


a ee 


Differentiating w.r.t. 4, we get 


ôl LO) _, 2 
an A 


Equating to zerc and solving for À, we obtain 


A 
N= 2 = 7, the sample mean. 


Example 15.14. Assume that the random variable x has th 
exponential distribution e 
fl; 8) = Oe, x>0,0>0. 
where 6 is the parameter of the distribution. Use the method of ML + 
estimate O if five observations of X were x, = 0.9, oe 2 
x4 = 0.3 and z; = 2.4. Xs = 94, 


Let X;, Xp ..., X, be a random sampl si 
» fay ony ple of size n p 
exponential distribution eres me 


f(x; 8) = Oe, x50,0>0, 

Then the likelihood function for the given sample is 
L) = Ife; 6) = 0” gos 

l iz] 

The natural logarithm (In) of the likelihood function is 
In L(8) =ninO-6 dy, 


Diff p iati g wr q ua ngt 
erentiatin wrt, 6 and e i o zero. we get 
, 


d 
x [n L(6)) = ~ Te = 0 


Solving we B n 
? get 6 =— 
Jy, © the MLE of 9, 


l 


r= 


Tow consider the given Sample data 
ata, 


| f(x) = 


| 
| The natural logarithm of the likelihood function is 
| 


ICAL INFERENCE: ESTIMATION 
aris 

; pre 15.15 Let Xi, Xp, ..., X, be a random Sample from 
aon with parameters | and 62, find MLE for a 


opul 


m when popu 
li e when population mean | is known, 


| ü i 40? simultaneously. 
i) density function of a normal distribution js 


Th 
e- X-u)?/202 


lation variance 6? is known, 


| 
rie likelihood function for a sample of size n with values 
hae £ 


| = 1 Pa 2/252 


r 2 
7 ( 1 J , g Zi mH)?/202 
270? 


n n 1 
In L(®) = — 5 In (27) ~ 5 In (6%) - zaći =m)? 


() Here O = u as o? is known. Therefore differentiating it w.r.t. 4 
and equating to zero, we get 


ð 1 : 
— [l] =) = (-1)] =0 
Bu [In Liu) = 0 ial [2 Dx; - p) 


Which gives n = Bri 
n 
Again 2 Un Liw] = -<0 
ðu? Misma S 
Thus the MLE of Hisž = a the sample mean. 
a ; 


ti 
) Here @ = GO? as | is known. 


ae Differentiating w.r.t o2 and equating to zero, We get 


-ô 
“>in LoD = -2 -Lyi - pw)? = 0 
Ag2 l” L(o2)] Rae La- p 


Solving for o2, we find 


i 1 
o? = a D(x; — u)?. 


~T 


2 


‘ull 

~ out 
The 
the 


INTRODUCTION TO STAʻISTICAL Ta 
HE 


-e ; ORy 
g2). To get joint ML estimators for u da A 
» We 


e two equations 


92 
(iii) 

have to solve toge 
s aus 62] = 

ô [ln LiL, 0”)] =0 and 02 [In L(y, o )] =0 


Here 9 = (H, 
ther th 


au | 
pe 
Now xl L(y, 0%] = 0 gives = > > [2 L(x; =w) (-1)] = 0, 
2 2)) =0 pe + yy 2 
and T ee gf g Ane 


Solving these equations for | and o*, we get 


ieee and 
n 
ot Aye)? wh " ers r 
z Le; , Where U = X is substituted. 


Thus the joint ML estimators for H and GO? are the sample mean ¥ 
and the sample variance S2, which is not an unbiased estimator of o2 j 


l 15.4.2. The Method of Moments. The method of moments which 
is due to Karl Pearson (1857-1936), consists of calculating a few 


Let X;, Xp, ..., X l i 
: AD u X, bear : 
moment about zero ig andom sample of size n. Then the rth sample 


and the correspond; 
Ponding rth po ; 
population moment is LU. We then match 


these mom 
ents and get as. 
as i 
unknown Parameters, many equations as we need to solve for the 


The following exam 
Example 15,16, 
(0, 6). ind ane ti 


ples illustrate the method: 


Let i 
aA be uniformly distributed on the interval 
ot O by the method of moments. 


edi ; 
tak tion ha 
e only one equation, S only one parameter, we therefore need to 


RENCE: ESTIMATION 
aL INFE 93 
neti? 
+ sample moment about zero is m! =- 2X; f 
| ghe first samp my = “Pe The first 
| og moment about zero is 
Jatio’ 
tas 0 1 2 


0 
lrx Q . 
Wes Krde a= aS 
m= f (3) 0 z] 2 
7 0 
Matching these moments, we get 
Ty. A a 
DAG 8 or 8 = 2X. 
| o 
| jo, the moment estimator of 0 is just twice the sample mean, 
| fxample 15.17 Let Xj, Xp, ..., X, be a random sample of size p 
‘jm a normal population with parameters u and o7, Find these 


j prameters by the method of moments. 
Here we need two equations as thece are two unknown parameters. 


i 
| 
| 
| 
Í 
| 


| 

| The first two sample moments about zero are 

i 
: 5 1 

| n, = DX, = Ž and m, = — DX’. 

] n i l 

| The corresponding two moments of a normal distribution are 


Hy =H and pg = 02+ pÊ. CS 0? = p- 159 


To get the desired estimators by the method of moments, we match 
them, Thus i 
l awk pe and ey n= rr 
n : n $ 


Solving, we get 


i a i 5 
= Pa lax, and 


of = = 


n 


gt 20" ga igg- =s 


ast i 
he moment estimators for [t and O°. 
f Least 


tla The Method of Least-Squares. - 

Marko S, abbreviated LS, which is due to Gauss ( f fee i i 

esti v (1856-1922), is based on the theory of linear ese ete 
oe round by minimizing the sum of squares of. pee? s 

te i aes from some function that has been hypothesiz sia 

® ìs called the least squares estimator. 


The method 0 


: + rela 
Or ony iger the linea" 2 
By . our purposes, it will suffice to conside n parametets. Aah 


Ce he where œ and f are the unknow 


77-1855) and 


tion’ 


RENCE: ESTIMATION 95 


INFE 
; ICAL . E 
INTRODUCTION TO STATISTICAL s Confidence Interval Estimate of a Population Mean. 


at 
TH al ğ š 
Eory , 15.5.1 a confidence interval for the population mean H, we have to 


94 random sample of the random vy. 
a rando Vari, ; P 
(x, Y); i= 1, 2, wy My be the values of Q and B th able Y, l, ampute not the population is normal, whether or not the 
th their associated values. Then at Minimize nether OF ‘ian ot d whether th te 
with their associa ee whi tandard deviation is known, and whether the sample size is 
the sum of squares pzp]? | polation 1 We discuss these different cases below. 
ZIY; = (0 + Pi | ge or SMA ‘ : ; 
timators of the paremeters a and B. The meth | Normal Population with o known. Let a random sample 
od i 2 f size n be drawn from a normal population with 
Xp Xp or Xn OF 8 
p 


es : À 
aeg Gaal 10.4.2 and chapter 10 of Part I). aq nknown mean 4 and a known standard deviation O. Then 
" ‘sampling distribution of the mean X will be normal with a 

the 


oe ONFIDENCE INTERVAL | 
eee | d a standard deviation O/Ņn ; and the variable 
A confidence interval estimate of the unknown parametey 9 tan mean and a standard deviation o/fn 

interval sompiitod {ren a vandom sampla of 1 valtana; Xd o Xn With a | _ XH yill be exactly standard normal, no matter how small 

statement of how confident (e.g. 90 per cent, 95 per cent or 99 per tend Z= A kig S 

we are that the interval contains the unknown parameter a ea iii ia The normal distzihutian tella us that th 

confidence interval estimate is in the from (L < O < U), where L, and y Ss y that a value of Z will fall in the interval from -2,/9 to 
z m is equal to 1 — O&, where z, 2 is equal to &/2. That is we can 


depend upon the value of the statistic O of a random Sample selecteg 
from the population and the sampling distribution of the Statistic, To, 
make an assertion that Ô lies in the interval (L, U), we may determine 
from the sampling distribution of 8 two values L and U such thet’ 
P(L<8<U) is equal to any specified probability, conventionally denoted 
by 1-a. If, of instance, L and U are two statistics such that fer all 0 
PL<@<U)=1-a, for 0<a <1, 

then the probability of the interval (L, U) containing the population 
parameter @ is 1-(, The interval (L, U) is called a 100 (1 — q) percent 
confidence interval for the un!:nown parameter Q, the probability (1-c) 
_ associated with interval estimate is called the confidence co-efficient or 


itn (1-a) of containing the true value of the parameter. For p—20 z ets, we proceed a8 
example, if q = 0.05 then the pr Bis . r _ A he brackets, 
» obabilit terval (L, U) E R thin the 
contains O, is 0,95, 7 ee Tin put jt inside inequalities wi 
Th i at 3 ý 
eins pitti nag bound the confidence interval, are called the i is by -> and get 
variables pri confidence limits for 8. These limits are random a) We multiply all terms inside the bracke an 
confiden v i ey can be different for different samples. The width of the 
ence in erval, ie, the giten.. i s the fe} 
estimate, The Precision i difference U= L, is called the precision . le l z pe El Zaj? Jo l 
size or by decreasing o% be increased either by increasing the samp!® - Za/2 Vn n 
interval w easing the confidence level, The concept of a confidence (b 5 ‘m and have 
© was introduced in 1937 p the Polish-English-America ) We subtract X from each term 
Statistician Jerzy Neyman (1894 y the Polish-Eng. ‘ ' m + d 
Some. of the ~1981), fot ug pone pena t 70/2 In 
saw commonly used confidence intervals fo ii yn ‘ 


scussed in the sections that follow. . 4 


—~ ss 


INTRODUCTION TO STATISTICAL yy : 
EORy 


9 
all terms by —1 (remember, we inverse the di» 
ection 


() We multiply 
of the inequ 
_ inequality by a nega 


Y ook =z, a. 
X+ Za /2 h a/2 an 
which is equivalent to 
o o 
Rota ie <H <X+ Zaj Jr 
itute thi lt in th ili 
(a) We substitute this result in the probability statement aia 


tive number) and obtain 


P [Ran pet zan] =l-q 


Hence, for a particular sampl2 of size n = 
pt aaas » a COUTO) per cent 


confidence interval for p is given by 


£=2Z ae 4 noa 
Ce KaT , 


which may be expressed more compactly as 


FEZ m- A 
a/2 h 
If, for i ir 
or instance, we desire a 95% confidence interval, ie (1-a) 100=95% 
» Le. = 95%, 


then fr 
cc avis of areas under the normal curve, we find that the 
.025 18 1.98, and the 95% confidence interval will be from 


we 


F-1.96 -2 to z +196 "R 
yn Vn his means that about 95% of the. 


intervals found in th; i 
S found in this way will contain the parameter p 


Example à 
into bottles b ee The Standard deviati nts poured 
amounts of fill in a ia. filling machine is 1.8 ml (millileter). The 

om sample of bottles, in mI, were 451, 479, 482, 


480, 477, 478, 48 
, » 481 and 4 ‘ 
notmal. Construct g ie Suppose the population of amounts of fill is 


confi ‘ 
Bottles tilled by the Machin afiderce interval for the mean amount in all 
Am 


The 90% confi 


ae dence ints 
Is e Intarve s 
>igiven by rval for the mean amount in all bottles, H» 


AM - Etap. S 


4 


ality sign when we multiply both si des ace 
OO the j 


A I A A S 


$ i 3 R 3840 _ 480, 5 = 1.8, n = 8 and zp o5 = 1.645. 


zn 


a Lg these values, we get 


gubstitu ig 
480 + 1.645 ko 
— We 
+ or 480 + (1.645) (0.636) 
480 + 1.05 or 478.95 to 481.05. 


ence the 90% confidence interval for u calculated from the given 
sample is (478.95, 481.05). . 
Example 15.19. 
sample of size 25, for the mean of a normal population which has g = 50. 
The limits for the interval are 110.2 and 135.8. What confidence co- 


efficient was used? 
— ; 
The 100 (1 — &)% confidence limits are given by 


= oO > fe) 
oe. i oae 
eee $ š 
- ‘ 


A confidence interval is constructed from a 


Substituting the given values, we get 
i s | 
5 50 OoN 
x Eza = = 110.2 ù ie 
(25 s . , 
‘ le 


y 50_ i 
tzn" 135.8 
Subtracting, we get 

22, /2 (10) = 135.8 — 110.2 


25.6 
Zaja = 20. = 1.28 l 
a to the right of Za/2 is 0/2. a 


We know that Z4/2 denotes that the are 
—_— 


/ : lue Z 
Area tables, we find that the area to the 1 nt of the value 24/2 
= 0.2 
er. 


OA = 03, implying that a/2 =C.lorad 
D Thus 1 -q = 1 — 0.2 = 0.8 | p 
a $ ~ bout 
| Hence : % fidence co-efficient was use ! 
nce an 80% confide sA eet ane 7 ont 
aa ation with O a, the 


Gi) —Norm i ith o u 
ie Tae al Population WwW} me 
ample Xj, Xo, ..., X , of size n is drawn from a ee aa 
esos) ( na 
Unknown, we estimate o by the sample standard devia 


hich is then ` 


INTRODUCTION TO STATISTic 
AL THe 
Ory 


sed in place of O. If the sample size is sufficiently large (n> 30), t 
ite Central Limit Theorem allows us to assume that’ the sav hen 
distribution of X is approximately normal with a mean of H ang 

‘ a 
iation of —=, where S is the sample sta , 
standard deviation of -= 5 P ndard deviation 


The probability expression for estimating | then becomes 


S S S 
P| Z-z =< H< X+z, o>] = 1+ 
( w/2 H a2 =) 1~q, 
Thus a 100 (1 — Q) per cent confidence interval for H is given }, 
i y 
S ` = 
Xt 70/2 a a 
When © is unknown’ and sample size is small (n < 30) i 
E tae , th ; 
distribution of X will not be. normally distributed The cine 
` mpling 


distribution of X then follows a distribution, k 
istributi » known ' 
distribution. We shall discuss this case in chapter 18, : ii. 


1 oe 15.20. The Punjab Hi hway Department is studyi 
tra ie n on the . ar Lah re. As part of the study a 
rs par ment needs to estimate the average number of vehicles th ti : 
e Ravi bridge each day. A random sample of 64 days gives TEAT aud 
- i ys g = an 
680. Find the 90 per cent confidence interval estimate for u, the 


The 90% confidence interval for u is _ 

: 5 S 
-o zn =, ' 

, Ah 
w i x = 5,410, = 680, n = 64 and 20.05 = 1.645 
Substituting these values, we get = 
BL = 5410+ (1.645) (Fe) ; 


` 


b 64 
or 5,410 + (1.645) (85) 
or 


5,410 + 139.8 49.8 

| “© or 5270.2 to 55 

Thus the 90% intery ees 70, 5550 
o confidence into, al estimate for It is (5270 ) 


iii) .. Non-nor 
“normal ; 
samples). The NT T ulation with known or unknown g. (Large 
mit theorem tells us that for large sample sizes, 


the sampling distri 
g Istributj 
on of the mean X is approximately a normal, Even 


Sampling S 


| 
| 


/ 
„standard deviation = F 
| n 


. 99 
ce 
mpled is non-normal. That is; the random 


ESTIMATION 


f re RENCE 
J opalation = 


-lis approximately standard normal and consequently 


ate 100. (1 — &) per cent confidence interval for p, 
al population with © known is given by 


Kz ee 
a 
| se 0 is ‘unknown and is estimated by the sample standard 
fn ca : è 

oe the confidence interval estimate for u becomes 

yf ’ 

| 


í S 
j a: ee 


| fwe sample without replacement from a finite population of ia “4 
gi sample size n is greater than 5% of population size, then the 
| : P . . 2 
idence interval estimate for p is given by 


te G N-a 
sE Zad fy N-11 ` 


Sp , rom a population 
Example 15.21 A sample 5 z = 182 and $2=299., 


own to be non-normal yielded the sa 
tin roximate 99% confidence interval for 4. 
o allow us 


The sample size (n= 100) is large enough t ith mean =}! 
te sampling distribution of X is approximately pam 
Therefore the ap 


S 5 eo 
xt 2/2 m 
Š as d 20,005 = 
nag” # = 182, S = 299 = 17.29, n = 100 ae a S 


idence co-efficient is 0.99. 


to assume that 


proximate 100 (1-04) 


Per cent confidence interval for u is 


Thus the 99% confidence interval for p is 
| 


182 + 5 “ta 


_ OT 182 + (2.58) (1.729) 


z8 as the 


ger. 
`e null 
- about 

<. The 

ue, the 


500 INTRODUCTION TO STATIgtic,. ` 
CAL THE OR, 


or  182+4.46 or 177.54 to 186.46, 


Hence an approximate 99% confidence interval for u is (177.54 | 
Example 15.22. A random sample of size i , 186.4) 
£200, Selecteg 


i from a population of siz = 
n e N=1000 with 


` As the sample is selected without replacement and 

. is greater than 5% of population size N=1,000, we th 
following expression to.calculate the desired 95% iio 
i 


for pt: 


efore US the 
dence interya] 


Etz, eS N-n. z 
i hfa oi.” i 


. 
; | 
aa 


Substituting the values, we got 


69.2 + (1.96) 2:28... /1000 - 200 | 
V200 .\/ 1000-1 . 3 


“or 69.2 + (1,96) (4-08 
(1.96) Gs 7 | (0.8949) 


or 69.2 + (1.96) (0.068) 


or 69.2 £0.18 or 69.07 to 69.33. 


Hence th pial rui ta 
nce the 95% confidence interval for His (69.07, 69 33). | 


Example 15.2 e 
, 235 A sampl aai 
E unknown mean p; 7 of readings from a normal population 


- and unk Sart | 
ata: ntnown variance C? gave the following | 


without replacement f 

showed that x-= 69.2. Construct a 95 percent confid O=1 Mp | 
jess ence inj 1.08 

true mean of the population. se interval for ge 


its size nei 


| vest we calculate the ooled estima 


. Hence, the 90% confidence interva 


101 


ENE ESTIMATION 


p Le. Gal _ 24792.63 - ee 2 
S5 n ny 80 80 
1 

i -= 309.9079 — 309.8929 = 0.0159 


| ir m 
72 se 


LNF 


smote 232 = ny a 
2 _ EX _(EX\ _ 22536 _ (eazy 
| S35 n, nj) %2 72 


| ` = 313 — 309.76 = 3.24 
| tes X, and s? as below: 


nyt make _ $0017.60) + 72.07.60 17 69 
ny + na 80 + 72 = 


2 . 

nyS? + noS, 80 (0.0159) + 72(3.24) _ 234.552 _ 1,5637 
———————— —————— n on .' 

P7 ntng- 2 80 + 72-2 150 ae 

A90% confidence interval for Lt based on the combined samples, is 


1.5637 


or 17.60 + 1.645 
152 


or 17.60 + 1.645 (0.1014) 


or 17.60 £0.17 or 17.43 to 17.77 


l for pt, on the basis of two samples, 1S 


3 eH, 


i ve 1 


As 
n f readin A 


Second sample o 
72.= 12, 2X = 1267.2, Yy 


estimates of u and i cased co 22536. Combine the two’ samples to give | 
~- —~ Bive the approximate 90% confidence interval 


for p. AOE 
Fir: l LU., M.Sc. 1995) 

rst we calculate the sam an — - 

e means and the sample varanoes.” 


| 


For sample 1:3, = Zt 1408.3 


| edno; 5 
Points (i.e. limits) of the interval W 


1143, 11.77), 


|. 1552, i . nce Interval. A 100 (1-a) 


‘De E p 
Hr cent confidence interval for p is . 


‘ Pane: è TE 
P (Reyne < hl < X + 2/2 Z) 
i . 
s constant and it is the 
andom variables. 
particular 


on, Hi 
hich are 1 
terval for 4 


ti i 
Sto be emphasized that in this express! 


eri 7 i 
tefore after computing the confidence in 


ny ~~ gg = 17.60 4, | 


i 


Sample i+: h 
Ple, it is erroneous to say that 
Z =1-0,. 


eger. 
hè null 
about 
k. The 
~ue, the 


` 102 INT 
RODUCTION TO STATISJ 
se in this expression, no random variable ap ICAL THEORY 
i pears 
, Wher 
Cas 


becau 
statement is made about random variab] 
f es. Thi 
his mea 


probability 
lity measure cannot be atta 
ched to the. stated int DS that 


the probabi 
statement is correct, in the sense that it includes H, th erval, Ip 
» the probabil; the 
ili 


and if it is incorrect, the probability is zero. In nej 
. probability is (1-a). However, of all possible interv. is case, po 
cent will include p and @ percent of the interval pe ~o) 3 
the long run. a's will not inclug a 
i ude 1 ; 

To be specific, suppose an actual sample of siz me Win 
en a 16 : 

is 


from a normal population with an unknown me 
an jt and 
5 a kno 

Wh 


standard deviation 0=2. Let the sample man X = 6.2 
3 = 6.2 and the ¢ k 
Onfidenc 
e. 


co-efficient (1-0) = 0.95. Substituting these value 
S, We get 
P[62-1.96(—2= 
[ nae) < H< 6.2 + 1.96 ra ] = 
‘ 16 = 0.95 


or P(5.22 < u < 7.18) = 0.95. 


T S p ra d 
hi robability statement Is erroneous because Ll IS n 
ot a n 
om 


variable. The parameter | i 
r H is either i i 
the interval (5. ae bie interval or iti 
(5.22, 7.18), then P(5.22 < p < 7.18) or * is not. If y lies in 
: = 1 and if it doe 
S not 


lie in the int : 
ee erval (5.22, 7.18), then P(5.22 < u < 7.1 
. -18) = 0. We 
. can say 


P1096) <p <x ğ 
RE eu ce raf 2] -oss 


or PĒ- 
(X -0.98 < u < È + 0.98) = 0.95 


The interval ¥ 
al Ž + 0.98 ; 
any particular nume, ‘98 is a rando i 
var numerica om variable be ¥ 
samples It is therefore ier bu! takes different » ri ia 
rect to say that alues in different 


PÈ- 
A 0.98 < u < È + 0.98) = 0.95 


ning thereb t n, 
i i il ity that the random int l (8-0 98, 
: erva YO; 


X +0 i) 
98) cover. 
samples vers the true y : 
of size 16 from ‘ome of Lt is 0.95". In other 
rmal population with stand eps repeated 
ard deviation 4, 


the interv 
al (X ~ 
0.98, Xx 
+ i; 
0.98) will contain the true unknow. lue of 
1 n va 


H 
out 95 per cent of t m 
ab u Ime, 


To illus 
trate, ] 
a, let us dr 
aw 100 samples of 16 observations each, 


calculate X fi 
or 
for each canine, Sample and hence fi 
; k ‘ c i 
e find the interval (#,—0.98, x,+0.98) °° 


the rand ese 
om variab] Interval esti 
e $ stim 
Žare shown in ae based on 130 possible values ° 
igure Cn.next page 


ty is q.. i 


Selecteg 4 


value of h. Thus we see thac 
we cannot say that 


population mean p will be in the interval (6.2 — 


G Both the populati 


103 


"fhis interval does not cover! p) 
. On the ave -age, apotat95 of these 100 intervals will contain the true 
having taken ovr sample and found Z = 6.2, 


P{6.2-0.98<5 H S 6.2 + 0.98] = 0.95. 
confide 


0.98, 6.2 + 0.98). 


nt that the true 


-Rather, we say that we are .95 per cent 


r 


difference between two means, 


a Confidence Interval f 
struct the confidence interval for the 


"Hg, the following three cases are to be considered: 


h known standard 


ons are normal wit 
wn standard 


deviations. 
ti Both’ the povulations are normal with aikrio 

Heviations. 4 | : 
(iii) Both the populations are non-normal, in which case, poth sample 
P Sizes are necessarily large. ' i 
S a Populztions with known Standard eee 
Mean y e have two narmal populations. Population ae = 
Unki a ı and a known standard deviation Oj and pan ar a 
"own mean Hy and a known etandard deviatio” or pii 

fro. the populat san 


sampl 
es of sì 
, f size n, ond ng are taken 


L’ 


eger. 
“he null 
— ; about 
3k. The 
~ue, the 


INTRODUCTION TO STATISTICAL TH 
EO 


104 . 
sas 5 Ry 

means calculated. Let the sample means be denoted by Ži and %,, Ta 5 
n 


X is Normally 


the sampling distribution of the difference, xX, = 
Oe 


distributed with a mean |1;—2, and a standard deviation Tm o 
r 
1 Ny 


In other words, the variable 
Æi ~ Xp) - (Hi = Ha) 


oF oF 
men | 
ni My 


is exactly standard normal, no mat 
i ter how small th i 
We can therefore make the following probability ia meee ae 


i gq tat 
< Zz — me. 

< . g? . o? a/2] 1 Q. 
1 2 
Ajat 
ny No 

Multiplyin fone cs 2 2 

wuluplying each term inside the bracket by Si ; (o 

ni ong? subtracting 


(X,- ) 
2) and then multiplyj 
plying by -1 (inequality s; 
: quality signs reve 
i rsed), we get 


l i ; + 2 2 
_¥ O, O, aaa, 
Pia, M2) 2/9 2 6 ( _ o? 7 
ny ng S Pha E 2) t2u72 at 
= 
2 


Hence the 1 
00 (1-a) per cent confidence interval 


obtained, for 11 fi 
, - , or wre) z . 
H- H is particular samples 


ny ng ` 


100 carpenter kd TWwo inde 
retaken to aia ina of 100 mechanists and 
gories ‘ ifference betwee 
S of Works, The relevant pale a 
re given 


Population variance 


196 


(P.U., B.A./B.Sc. 1986) 


T 
he 95% confidence limits fo 
rh 
S l 


~ He are given by 


[omal populations with unknown stan 


- 
we get 


5 + (1.96) (2) 


h 

| pstituti 

|v 

| i96 204 
| - + a 

(345 - 340) + LUA * G0 
} 

| i 


or 
f -53.92 or 1.08 to 8.92. 
| Hence the 95% confidence limits for the true difference between the 
| chanists and carpenters are 1.08 to 8.92. 


age weekly wages for me 


| gimilarly, the 99% confidence limits for H; — Hlg are 


l oi o 

| = + pa oh eee 

| (, X9) = 2.58 = + Ta 
6 204 


19 
or (345 — 340) + 2.58 Too + 100 


5 + 2.58(2) 


5+5.16 or —0.16 to 10.16. 
ifference between the 


or, 
or 

| Thus the 99% confidence limits for the true d 
serage weekly wages are —0.16 to 10.16. | 
‘(i Normal Populations with unknown Standard Deviations. 
ny and ny are drawn from 


When the independent samples of sizes 
dard deviations, we estimate 
f sample sizes are 


tions. I 7 
ution of 


them by the respective sample standard devia a ed 
i ficiently large, then we can assume that the sampling distr! 
an p; — Hg and 


the difference, X, — Xp, is approximately normal with me 
cent 


l Sta s? s? z 
| andard deviation 21,22 ë Hence the ioo (i-a) Pe 
ates ar samples obtained, 


| tonfidence inen 4 
: idence interval estimate for Ly He» for particul 
{ Á 


; a 
It om cle nd the population 
on the other hand, sample sizes ar? a use Student s 


ave i ji 
. ~. Unknown equal standard deviations, 


“Ss 


106 INTRODUCTION TO STATISTICAL THE 
OR 
t-distributio Y 
shall be discussed.in chapter 18. 
` Example 15.25. A test in statisties was given to 50 girls 
boys. The girls made an average grade of 76 with a standard dev anq 75 
6, while the boys made an average grade of 82 with a standard q Bins of 
of 8. Find a 96% confidence interval for the difference pat on 
is the mean score of all boys and [ly is the mean score of A R 
might take this test. ie (P.U., B.A/B ie who 
a Se. 1 
__ As the sample sizes are sufficiently large (n,, n, > 984) 
therefore use the sample standard deviations S, ahd S, in p] 
opulati jati ; 2 10 place 
population standard deviations O} and Oz. Assuming the Populations - f 
. to 


be normally distributed, the 96% confidence interval for Hi = flo i 
17 Hg Is 


_ S s 
(Œ; -¥_) £ 2.054 Jaa 
ny No 


where ‘24/9, Le. Zo,99 = 2.054 
Substituting the given values, we get 


(82-76) + 2,054 BE , 2 
V 75 50 


or 6t 2.054 = + Ee 
po 50 
. or 6 + (2.054) (1.254) 
or 
6+ (2.58) or 3,42 to 8.58. 


Hence the desi 
ired 96% 
from the given values, is thi 


-(iii) 


Interval for p; — Hz calculated 


a the sample sizes are sufficiently 

though ference X, — 7, ™ tells us that the sampling 

uga the populations are ie “2 will.be approximately normal even 
“normal. An approxi 

mate 100(1—-Q) per 


cent confidence ; 

i € interval for 

deviations are known std T Hg when the population standard 
; e A 


n to construct the confidence interval. The t-distrib 
ution’ 


30), we can: 


approximately: normal with a mean of 


interval to bec 


E: ESTIMATION 407 
standard deviations are unknown, then they ie 
e standard deviations. The approximate 100(1-c) 


are the sample standard deviations. 


where Syand 5z 
Mi Confidence Interval for Population Proportion. 
Let a random sample of size n(n > 30) be drawn fiom a 


s gíaple). : 
with an unknown proportion of successes p and let 


Now, we wish to estimate p by an 


omputed from the sample data. 
at the sampling distribution of a sample proportion P is 


We know th 
p and standard deviation of m 


ample size is sufficiently large and p is not too close to zero or 1. 


ifthe s 
‘Thus the random variable Z = P-P will be approximately N(0, 1) and 
pa l 
E 
hence we can state that 
4 $ =y 
p|- 2 <== <2 j= 1-0 
0/2 $ pam 


Multiplying all terms of the inequali directi 
. E £ 
fiom each term and then multiplying by ~ 


inequality signs), we get 


“a R ba E g 
pnn Eer ran f a 
zen (n ? 30), 4 approximate 
p is given by \ 


1 (reversing the 


Thus, for a particular sample of si 


\ 


samples, 


But we ar i i 
e faced with the difficu x large 
ae Proportion involves the unknown P- fe at wi 5 T 
ifficulty is overcome by using the sample propo! sotto) per, cent 
ence for 5 ; proximate 
e for a particular sample, the a p will become 


` Con : 
fidence interval for population prop 


ty by \pa/n subtracting P 
on of 


integer. 

~ the null 
‘ats about 
task. The 
c “ue, the 


Example “In a certain large vity, a random sample of 400 
families contacted by a local TV station showed that 275 owned colour 
TV sets. Find an approximate 90 per cent confidence interval on the ty x 
proportion of all families living in the city who own colour TV sets, = 


An approximate 90% confidence interval for p is 


i ve 
/ pa 
f = PE zaj nes 


‘Now, sample proporti pem, q 
pe proportion p = Zog = 9:89, sog = 1- 0.69 = 0.31, 


The degree of confidence is 90%, therefore 20.05 = 1 645 
, Substituting these values, we get a 


ow a z 
0.69 + (1.645) 4 /{2:89) (0.31) 
Vwo. 


0.69 + (1.645) (0.023) 
or +0.69+ 0.038 or 0.652 to 0.728. 


Hence the a ci 
ence, pproximate 90% confid i 
proportion that owns colour TV sets is (0 ie ae, 


y1 


or 


n 


for, the true 


15.5.5, Confidence Interval 


2 be the proportion 


deviation of 4 /21.%1 4 Pade 
ny Ny ? when the Sample sizes are sufficiently 


INTRODUCTION TO STATISTICAL THEORY 


mnk 2 


> CE: ESTIMATION . 
FEREN 109 


P | -Zaj2 < 


IN 
sth (Py — Po) — (Pp; — po) 
|. < Zap|=l-Q. 


Pidi p22 q2 ae 
ny na l 


at the standard error. of P} — P, involves the unknown 
ey and pa YE thercforo replace P, and p, with their sample 
ra “and py. Hence for sufficiently large samples, an approximate 


Fjimatoo P ag , 
ent confidence interval for p,—po is given by 


00-0) per c 
(p -P2) t 24/2 vo t ý 


1 n 
vampi 139 In a poll of college st in a large state 
niversity, 300 © 400 students living in dormitories approved a certain 
course of action, whereas 200 of 300 students not living in dormitories 
approved it, Estimate the differ tiong favouring the 


ene 


We see th 


; (P.U., B.A./B.Sc. 1982) 


Let Pj and Do be the observed proportions in the first and second 
sample respectively. Then i 
200 


a 300 n 
_ 300 _ aal a 08. 
UE 7 9-75: and Pa = 399 ~ 9 


Difference in proportions = Py -P = 0.75 — 0.67 = 0.08, : 


| . The degree of confidence is 0.90, therefore Zo,95 = 1-645. 


The 90% confidence interval for p; — Pg is 


Pid L, 
ny ng. 


(p 1-P») an ({.6à5) 


or 0.08 + (1.645) 


or 0.08 + (1.645) J0.000469 + 0.000737 


“he 0.08 + (1.645) (0.0347) 


or 0,08 £ 0.057 or 0.023 to 0.187 


Henc f ce interval for Py 
e the.90 per cent confiden erval! Since t 


15.5.6 One-sided Confidence ed and an upper yet 


Lg i 
SOSU for the parameter 0 specifies both a ded 100(1-0.) per 

nd only an uppe? 

a one- 


is (0.023, 0.137). 


=P 
he interval 


aa therefore be ence appropriate to cali 
0 nt confidence interval. Occasionally, we ™ 
“a lower limit but not both, for the poe 


we want 


it 


345. 


-er the 
‘, yriable 


integer. 


n the null 


mts about 
task. The 


ir 


“ue, the 


the entire'@ area will be located at dng 
sided rugs pr s Thus a one-sided 100(1-0). per rant fe 
of ” aAa for @ is given by the interval L < O where iga 
ee confidence limit L such that P(L < 0) = 1-a. 

Similarly, a one-sided 100(1-@) percent upper confidence į 
for O is given. by the interval 0 < U where we choose ‘he 
confidence limit U such that P(O < U) = 1-0. 

; In case of a normal distribution with © known the Probability 
statement for the one-sided 100(1—c) per cent confidence in teva 
“would be ; 


end 
Choose 


Nterya] 
Upper 


for u 


7 es O. : o l 
-z —/=1-a or Pluck + = 
A (ee 2a 2) ! i aT a. 


Thus in one-sided confidence interval, a 100 (1-a) per cent lowen, 


` n oO i x on 
confidence limit would be ¥ - z, A , and a 100(1-a) per cent Upper. 
Ss. 
Ti 
, One-sided confidence intervals for the difference 
means H-H may also be obtained. Thus for example, whe 
are normal with known standard deviations, a 100(1—a) p 
' confidence interval H>H is 


` confidence limit would be Z + zy 


between two 
n populations 
er cent upper ` 


' D 8 
; o 6 
UH < (X,-X,) + z 1y 
i HiH < (XiX) + 2, hy m 
and a 100(1-0) per cent lower confidence interval is 
2 2 
S a Oo o 
(X)-X,) ~ ey = Te 


Su,- 
ny ` ng HL Ha 


(P.U., B.A/B.Sc. 1986) 


„~ one-sided confidence interval with upper limit 
Pounds). Therefore the lower limit of the 95% 
stimate for His ' 


This is an example of 
"aS a constant (280 


confidence interya] e 


INTRODUCTION TO STATISTICAL THEORy 


Wer ` 


NFERENCE: ESTIMATION 


at 
| os 


6 40 
i z — (1.645) —= = 120 — (1.645) ——— 
z= ( Jn /100 


= 120 — (1.645) (4) 


= 120 — 6.58 = 113.42 pounds. 
-<7, Sample size for Estimating Population Mean. The 
; 4 an per cent confidence interva’ H is given by 
wa- 


a S Bee 
Oe See e 


o 
a/2 Vn 


j 


| which may be written as 


i a 


ie} 
“of2 Vn’ 
n 
| where m s the standard error of ¥ when sampling is performed with 
| n . 


l replacement or population is very large (infinite). The quantity |X — u| 


_ | isalso called the error of the estimator X and is denoted by e. Thus a 


| P Bo 5 g- 

| 101-0) per cent error bound for estimating y is given by 20/2 ta In 
| n 

| other words, in order to haye a 100(1—c) per cent confidence that the 
| error in estimating u with 7 to be less than e, We need n such that 


(0J 
e kn 
or Vn = za Z 


Zan 0 \* 
or n= (28) 
e 


. . l i t that 
Hence the desired sample.size for being 100(1—0.)% confident ths 


ampling is with 
the error in estimating pı will be less than e, when sampling 
"placement or population is very large, is given by 


pe 
Í . . 
ndard deviation O is 
m past experience or 
l result, it is always 


It is important to note that the population pia 
rally not known, its estimate is found either f ai 
Pilot sample of size n > 30. In case of fraction le size. 
ounde to the next higher integer for the samp 


Eene 


om a 
tobe p 


t 


ger the 


"y variable 


645. 


r integer. 


~e the null 


ents about: 


task. The 
rig “Ue, the 


412 si 
‘of X pling is perform 
dard error of X, when saripling i Š 
i iacenant iwa a small population of size N, is given by 
re) : 
l 6 N-n 
= h NN-1 


In this case, the 100(1 
becomes 


-) per cent error bound for estimati 


0 N-n . 
oh N= 
Solving for n, we get 


ee ae 
(N- 1) 4? + (2479 0)? 


as the desired sample size when samplin 
small population of size N, 


~ 


- ‘ Example 15.29. A research worker wi 
of a population using a sample sufficiently lar 


om the true mean by mor 
than 25 per cent of the standard deviation. Ho 
taken? 


If the sample mean will not differ from the true mean by more than 
25% of © with a probability of 0.95, then 


250 _ o 
10 =’ and Za/2 
Substituting these values in the formula 


} 2: 
-|229 
n= fes » We get 


‘(1.96 x g\* 
n= ———— 
f o/å | 61.4656 


H s 4 e . o 
onesha required sample size should be 62, the next higher 

S thre sample size cannot be fractional. 
mple siz 
arge sample confiden 


Aa weya = 1.96. 


15.5.8. i 
ine Sa e for Estimating Population Proportion. 
ce interval for p is given by 


D tza 2 


INTRODUCTION TO STATISTICAL THEOR 
Y 


d Without 


g is without replacement from a 


shes to estimate the mean 
ge that the probabilit will 
be 0.95 that the sample mean will not differ fr z 


w large a sample should be 


F 


| NFZRENCE: ESTIMATION 
cal = 


his implies that e=2Z / 
i 55 2°" gq 


refore solving for n, we obtain 
j shen 


D q ar ; has not yet 
ng u ce the values of p and q are not known as the sample has not y 
gin! 


d, we therefore use an estimate p obtained from pilot sample 
lected, A 7 

, pn SNP f l 
ae > 15.30 In a random sample of 75 axle shafts, 12 have a 
| e at is rougher than the specification will allow. How "i ge 
aeai if we want to be 95%. confident that the error in 


q 


wing P to estimate p is less than 0.05? 
aca 


} 


A 12 «.. 
a> == = 0.16 
Here e Ip-p| = 0.05, p 75 > 


A 


( = = 0.025) 
q = 0.84 and 20.025 7 1.96 (a/2 Ai 


i Substituting these values in the formula 


1-p= 


9 


n= (2) Pq we obtain 


| 
S e 


ae a) (0.16) (0.84) = 207 
| ` 0.05 S 
| asthe required sample size. 


EXERCISES . 


ae J ‘tical inference? 
15.1 (a) Explain what is meant by statistical in 


(b) Distinguish between 


; sis. 

(i) Estimation and Testing of Hypothe 
Gi) 
-j (iii) 


Estimates and Estimators. 7 | 
Point estimate and Interval nae ue Sasi 
What de you mont by Poimi aig, me mee 1986, 91, 93, 96) 
| < Properties of a good point estimater jA : yoperties of & point 
| (a) Explain with examples the: following. P scale é 

| te estimator: 


ii) Efficiency: 


(i) Unbiasedness, (ii) Consistency, 


113 


= (b) Ezplain what is meant by the Mean Square Err 


or. of 4 
; r. Of -an` 
timator. Prove that MSE(T) = Var(T) + (Bias)2, an 
esti . 


(P.U., B.A./B.S¢. 
) Is an estimator a random variable? Why or why not? 
íc) Isan : 

15.4. (a) A sample of size 10 yields valucs 8, 4, 10, 5, 5 


, 4, 9, 4, 3 
Estimate the mean, the variance and the third m 


Car 
- of the random variable X which models these value 


S. 
(b) Find a point estimate of pt and the estimated 


in each of the following cascs: 
(i) n= 70, Xx; = 852, E(x; — x)? = 215, 
(ii) n = 160, Dx; +1985, L(x; =x)? = 475, 


INTRODUCTION TO Statistic AL THEG | 


1992) PEG th 


’ vi 
l Nicment 


Standard error 


A 


; 115 
IMATION 

NCE: EST! 

RENC 


pea #0 


ii) P [= is a consistent 
t estimator of pt ~ Ha» pei ( a ined under 
nse m if the random sample is. obtaine 
estimator of $? t ae 
estima” lacement. i 
mpling with ToP ; opulation 
sampling W ae sample taken from a pop 
í Xx. Xo iiag 
p If 44 


` that 
variance ©, show t 


i nd 
‘with mean H 8 


, nin + 1)/2 - l í 

; p ? . : i" consider 
oe ample of 3 observations, 
15.5 (a). Let X}, Xp, ...,.X,, be a random sample from a Population With Based on a random g , 
: a mean of p and a variance of 0”. Then prove that 9. 0) possible estimators of H: 

on 
%=—_ E-D- 
ae ee 


is an unbiased estimator of o%, (P.U., B.A/B.Sc, 1995) 
= 2 arc drawn with replacement 
from a population consisting of the six members 10, 12, 14, 
16, 18 and 20; show by findin 


g all possible samples, that X 
and s? are unbiased estimators of |t and g2. ' 
(a) Explain what you understand by 
(i) an unbiased estimator (ii) a consistent estimator, (iii) the 
relative efficiency of two estimators, and (iy) a sufficient 
estimator, à ; : è 
(b) If 


(b) If random samples of size n 


15.6 


Xy Xa and X; 
Population with th 


are a random sample from a normal 
c mean jt and the variance 6%, what is the 
. i woo l 2 +X; 
relative efficiency of the estimtor T) = aere with 
respect to T, = X9 i - (P.U., B.A./B.Sc. 1993) 

(€) Explain Why the sample mean Ž, as well as ‘he sample 
Proportion P, is °° . O 

(i) an unbia 

Cii) an effic 

(b) Taking all 
finite popu 


15.7 


sed estimator, (ii) a consistent estimator, 
lent estimator, 


(iv) a sufficient estimator. 
pOssible 


‘om a © 
S Samples of size 2 with revlacement fron 
lation 2, Band 2, show that 


| B10 (a) 


1 
" 5 ly += X33 
“= + 2 
R= 941 3 3 


+ 0.4 X3 
3= 0.2X + 0.3 X3 ii) the effic ; 
d (i) which are oaia "P.U, B.A/B.Sc. 1980) 
an : ive'to 0 YS , ‘om 8 
i : relative fro 
unbiased estimator re ‘Xg is to be’ taken 


wing 
W) A random sample Xy Xo =» 2, The folo 


ew 6 
variance 
population with mean p} pi 
+ for p: 
i ‘statistics are to be taker 
Ji 


AX, + Xa, 
a k Xit Xat Xa, mo i 6 
a 


aati: 


naai? 
; <œ unbiased? 
~ (i) Which of the above are un {ficient? 
i e i 
(ii) Which of the above.1s m Explain the 
fficiency: OOF. 
What is meant by su 
5 sample 
(b) Let Xi, Xp vn Xn be a random 


ï; 2, ade 
fix; p) = p*q, x = © 
Statistic, ` 


Bie £ (x, - 2%. (P.U., B.A/BSe. 1987) 
2 where S° => oe 
, . 


f A -of pu (ii) t- X, isa 
: istent estimator of pl, 
. ga consis 
at (i) Zi 


i ch 
iency of ea 


Ak 


c Jer the 
y rgriable 
645. 


- integer: 

~e the null 
ants about 
“task. The 


is vue, the 


we INTRODUCTION TO STATISTICAL THe 
: ORY 

(c) Let Xj, Xy oo Sn be a random sample fróm the nop, 
rmal 


distribution wilh mean jt and variance G2. Show that Ži 
J ls a 


sufficient estimator for u. 


"15.11 A random sample of n, observations , taken from a populat; 
; ation 


with unknown mean H and unknown variance 07 
t 6°, has mean ¥ 
1 


3 . 2 
and variance S} A second sample of ng ; 
ace Oy samp 2 observations, has mean ` 


fico OP 
Xp and variance S,. Show that an unbiased estimator of the 


nyX + na 


opulation mean is given b, f : 
ia. je ae Ee y ny + ng and an unbiased 


estimator of the population variance O? is Nan b 2 2 
8 y (nS nS À 


(ny + ng- 2). 


-15.12 Two samples of sizes n) and ny respectively are drawn from 
i an 


infinite population having mean Ņ and variance O2. The tw : 
; 0 


sample means are denoted by X, and X, re : 
estimators are defined as follows: a Respectively “Ti 


Ty = (nyX, +n) / (ny + n) and T; = (X, + X,)/2 | 


- (a) Show that, as nj, ng —> ©. Var(T;) > 0 and hence T, can be 
_thought of as a consistent pooled estimator for pl. , 


M) Sk i) Ty I 
now that. (i) T is also un unbiased estimator for p, 


$e a E o2 rl 1 
(ii) Var(Ty). = = E + —] and hence that T, is also a’ 


4h, ng 
consistent estimator for |. 
(c)’ Show hat aar (T) án , 
o Var (T) (n + ng? and thus, in general, 


Var(T,) < Var(T,). 


15.13 (a) What is ; , 
the basic criteri , i : 
i on: of estima -method of 
maximum likelihood? tion. by the method 
(b) Define : i 7 
ee hi Likelihood Function and describe the 
estimator igal technique of finding maximum likelihood 
15.14 » (P.U., B.A/B.Se. 1993) 


(a) What do y % : 
estimator? you understand by a maximum likelihood 


(b) If X, X 
LA X, bear : r «am the 
Bernoulli He eli sample of size n taken from 


f(x) = Xn lax 
p q >, #0) 15 OS pS, 


zl a 


SE S 


ENCE: ESTIMAT: 


he maximum sikelihood estimator of p and show that it 


K 
find t 
(P.U., B.A/B.Sc. 1986-8) 


ig 9 sufficient estimator. 
s : 


sider the binomial distribution 


fp) = (o) gh a ERL Za 


find the maximum .ikelihood estimator of p when 


(ia single observation is taken; 


(ji) 


g 0 


a sample of nr observations Xj, Xg; -) Xm is taken. 
(P.U., B.A/B Sc. 1992) 


ind the ML estimates of p and o* from a sample of size’ n 


MEOR l 
dent observations from a normal distribution. 


indepen 
(b) Find the maximum likelihood estimates for Oja and 0,=07 
if a random sample of size 15 from N(jt, 6?) yielded the 
following values: . ; i 
31,5 36.9 33.8 30.1 33.9 
35.2 29.6 34.4 30.5 342 
31.6 . 36.7 35.8 34.5 32.7 
$17 (a) Let Xp Xo vo Xi ‘be a random sample taken from the 
negative exponential distribution l 


1 
(x) = >e, x > 0. 
| MR 
_ Tind the maximum likelihood estimate for À. 
(b) Let Xi Xo wy Xa be a random s 
_ population N(0, 62). Find the maximu 
(MLE) of o%. “ , 


‘BL 
8 If x), x9, a x, are the values of a rando 
r À, find an estima 


m likelihood estimator 
(P.U., B.A/B.Se. 1988) 


m sample of size n from 4 


iss : ; fÀ using 
Poisson population with paramete core = 


ü) the method of moments, 


5i (Gi) the method of maximum likclihood. . i i 
319 (a) Explai ‘ dence Interva', 
Expl ‘ i i) Confi en 
plain what. is meant by ( Coefficient. 


Gii) Confidence Limits, and (iii) Confidence 
(b) Describe the procedure followed in establishing 
estimate for a population mean. 


ample from a normal’ 


an interval” 


10 


118 ia 
4180S _INTRODUeTI i 10 STATISTICAL THe 
L THEO 
me 


(c) We know that the statement AR-1:960; <y<X4] 
5 96.) 
X 


= 0.95 is correct, whe the statement P(140 < u < 1 ` 
p 60) sa i 


0.95 is not. Exr'uin why the latter is erroneous. 


T Explair ‘ue meaning of the following terms: 


£, Random Interval, (ii) Confi l 
nee 5 : d 7 i 
(ii) Interval Estimation. S 
(b) Let the observed value of the mean X ofa random 
. as 5 
sits n=20.from a normal distribution with a p 
varia =, i 5 4 
ariance =80 be 81.2, Find a 95% confidence interval fi eg 
or jl. 


(c) Explain what i 
\ in what is meant hy the statement. “We 
R confident that our interval sstimate coi: T u” Ta 
.21 (a) How will ` i ) ee 
will you determine co.fidence interval . 
normal distribution? . - oe Sens 
(b) Find a.95 | 
‘95 per cent confidence in i 
5 tervai for LL, the tr 
4, ruo mean of 


`a normal population whi 
on , ‘hich has o = ; 
„size n=25 with a mean of 61.63 o = 10. Consider a sample of 


15.22 A school w; 
MS ishes to estimat 
dah gaits, a ate the average weight of st i 
a a ape ae se of size. n=25 is a a . i o 
ie peii 0 o be X=100 Ibs. The sta Pe 
e mo a ndard deviation of 
confidence inter o; ne Ts, ipi : 
a Fagan for the population fe gg the 90% 
ak Weights to be normal (P ai oe - 
ü nie ae a Wy D, ./B Sc 1991 
mean by Confi Sato l 
confidante i y Confidence Interval? Fi 
sampic is ma for mean of-a normal ria ro ie 
"bution when 


` (b) Find a 90 . 
% confidence i 

distribution wi ice interval for the m TER 
—0.9) n with O=3, given the sample pie of a normal 
( 3, —0.2, —0.4, 
(a) As confidence interval i< < (P.U., B.A/B.Sc. 1976, 87, 92) 

of size n=50, for i ìs constructed, from a rand que 
Which has o=21 t ne wacan yield of a an = — 
and 875.89 tons aa The limits for the inter popu ation 
. What confidepze nb emia are 866.11 
f t was used? 


(b) Th 7 
e 95% confidence inte 
particular brand jn 
iaterval is based 


15.24 


r the m 
of | ; ean length i 
ight bulbs is (1023.3 h, AT 


` bulbs, Fi on results fr 
. Find t} rom a ra 

he 99% confidence a era align 

f ` the mean length 


of life of thi 
ising Is br E 
life i and cfl 
Ife is normally ent a bulbs, assuming that the length of 


‘ oe 


f 
i 


P NFER 
j 


NCE: ESTIMATION 
E 119 


ae xplain the concept of Interval Estimation. 
A rendom sample of size n=50 from a normal oiii 


yielded the sample values z = 190 and S? = 800. Find a 95% 
interval for pl. . i g f 


‘confidence 1 
m sample of 100 values was taken from a normal 


population with mean p and the following results were 
= 3978.7 and DX? = 1583098.3. Find 98% 


obtained: DX 
confidence interval for p. (P.U., B.A/B.Sc. 1996) 


shh 
J w 


s to estimate the average amount of 
/ money a customer spends for lunch. A random sample of size 
l n=36. is selected: and the sample mean is found to bé 

z= Rs. 35.00. Assuming O = Rs. 13.77, find 95% confidence 


A restaurant wishe 


~ 


limits for p. 7 ~ 


ndom sample of n=10 
t replacemen 


(b) A ra 0 widths for two-by-fours was 
selected withou t from*a shipment of N=500 
poards. The results ‘show. that x 


‘inches. Construct a 96% confidence inter 

boards in the entire shipment. ` 

i.) In July 1969, the first man walked on the moon. Armstrong, 
‘Aldren and Collins brought back 64 rock samples. The rocks had 

*  an‘average Earth weight of 172 ounces. The sample variance S*, 
was 299 (ounces)*. The moon rock population is known, however, 

to follow.a distribution W mal. Find a 99% 


hich is not nor 
confidence interva! estimate for the mean weight of rocks on the 
‘lunar surface. 


val estimate of the 


ormal population with mean ft 
and LX = gy2 = 973.44. A 
ame population gave 


15.28.” A sample of 64 readings from a n 
5452.8 


Vand variance o gave LX = 
second sample of readings from thes 


a ; i 2 and give 
< Combine the two samples to 8! ies 
‘she approximate 97% confidence inte 

529 (a) Discuss the problem of finding 4 confide 


difference (HTH?) the two 
ormal distribution 
y equal. 


rval for p- 


nce interval for the 
of two 


means 


between 
s if the variance 


independent n 


are known but not necessaiil 


Di 1 
s o} and Sy 


= 3.5 inches and S=0.1% 25 TE 


a YNA 


va 


z ; - INTRODUCTION 'TO STATISTICAL THEORY . 


(b) Let two independent random samples, each of size 100, from 
2 two independent normal distributions Mamps and 
4) me N(y;[y30%) yield ¥ = 4.8, Ss = 8.64, ¥ = 5.6, S3 = 7.88. Find a 
- g4 . 
4 F 95% confidence interval for (;—py). 


(P.U., B.A/B.Sc. Hons. Part III; 1967) ` 


15.30 (a) If X}, X, are the sample means of two independent samples of 
‘ sizes n}, ny from normal populations with known. Yarianceg 


oi, o respectively, show that the 98% confidence limits of 


the differcuces botwosi pupune.sot means gre 


2 or ; 
fo Š; ; o> 
GEI taB aE, (P.U, B.A/B.Se: 1993) 
ü ny ng ` 


(b) How will these limits be affected, if 


(Gi) the populations are not known tò be normal; 
(ii) the variances are not known; 
(iii) ‘the confidence co-efficient is 95%, ` 

(a) Explain the difference between a point estimate and an 
interval estimate, Why isan interval estimate more useful? 
A manufacturing company. consists of two de 
Producing identical products, It is suspected that the hourly ; 


outputs in the two departments are different. Two random 
Samples of production hours are respectively selected, and 
the following data are obtained: i _. = 


15.31 


partments 


Department 1 


Sample sizo 


Sample mean 100 Ty 


The variances of the hourly outputs for the two departments 


= 256 and o, = 196 respectively, What is 


confidence limits for the true difference., 


In order to ascertain the age 
certain. industry, random sam 
females are drawn. The sample 
were 33.93 years and 14.2 


(P.U., B.A/B.Sc, 1986-S) 
distribution of operatives in a 
ples of 1720 males and 1230 
means and standard deviations 
S for the males, and 27.44 years 


15.32 


20. year 


O 


i 


: INFERENCE: ESTIMATION, 4 os, l 
aristicAL : 79 years for the females. Calculate the 95% confidence 
f + e estimates for i i 

inte mean age of all male operatives, - 

(a) the ean age ot all female operatives, and 

i ui rences between their mean ages. (B.Z.U., B.A/B.Sc. 1985) 
ues sai variances of the weekly incomes in rupees of the 


15.33 piian employed in the different factories, from the samples are 
` wor ers. RES P 
` given below: 


160 64 


B. 220 41 = 

(a) What is the maximum likelihood estimate of the sila in 
a. . . 

mean incomes? 


(b) Compute the 90 per ce 
J real differences in the 
factories. 


nt confidence interval estimate for the 
incomes of the workers from the two 


= 100 yielded the: sample value 


1534 A random sample of size ny =100 from 


x S? = 950. A random sample of size ny 
x1 = 509, aed ` i e a g = 876. Find a 95 ne 
another population yielded 2 = 447 and Sy 


` cent confidence interval for Hy ~ Ha: independent Bernoullian 
~ z ion in n in 7 
1535 (a) If p is the observed proportion © e limits for a population 


trials, show that the 95% peii sane i 
proportion p are, for large samptes, 


p +£1.96 Nz a 
labour force in a 
| bers of the astruct 
: ' le of 400 member: fovea. Cos , 
wh pegia = oa that 32 were sine AA unemployed in 
e perra interval for the propo 
e o CO : ; 
the region. 
-(c) A random sample of 75 co 


| $: n campus. 
found to have cars on N anain have cars 0 
n acinos 
- to estimate the fraction 


llege students 1S selected and 16 are 


(P.U B.A./B.Sc. 1990) 


4 : : the 
interval for 
a 95% confidence keper 


ii (a) Given X=60, construct distribution for w 


. % ] 
parameter p of a binomia 


e a 95% confidence interval . 
s 


122 S INTRODUCTION TO STATISTICAL THEORY 


(b) In a random sample of 1000 homes in a certain city, it is 
found that 628 ure heated by natural gas. Find 98% 
confidence interval for the fraction of homes in this city that 
are heated by natural gas. se : ; 
(c) A random sample of size n=144-gave p = 0.76. Construct a 
90% confidence interval for p. Interpret-the 90% confidence 
h interval. i 
15.37 (a) Determine a method for constructing a confidence interval 
for p; — po, the difference of two populátion proportions. 


(b) Find a 95% confidence interval {gr BPs if a sample of size 
nı=100 yielded Py = 0.545fn a eH bs of size ny=100 
yielded py = 0.49, OSH 5 tndiverk 

1548 A poll is-taken among residents oF a city and its suburbs to 
- determine the feasibility of a proposal to construct a civic.centre, 
„I£ 2400 of 5000 city residents favour the Proposal and 1200 of 
- 2000 suburban residents favour it, find a 90% confidence interval 


15.39 (a) A random sample of 65 bolts from a 


inches, Construct a 99% one-sided confidence interval for the 

maximum mean diameter of the population, ` 
(b) The life in hours of a 75-watt light bulb is known to be 
ly distributed with standard deviation 


ulbs has a mean life of 
l lower confidence interval 
on the mean life. ` ` 

ate the mean of a normal population 


how large a sample should y, 

is 10, } you take so 
that the Probability is 0.80 that your estimate will not be in 
` error by more than 0.4 units? ` i l 


D 


- `i&1 INTRODUCTION 


is j -œ discu 
- the hypothesis is true. thesis testing arc 


statistical Inference: 
Hypothesis Testing 


istical inference. 
Be, Pack important phase of statistica! in ion’ 
‘Hypothesis areas eat n decide on the basis enira pa 
It is a procedure "e data whether to accept or a euch a statement ' 
obtained from sis yalue of a population poe a statistical 
ee aid may or may not be true, !- 
or assumption 


or rted 
"being true, when it is supportec 
sis as being true, W ta fail, 
hypothesis. We accept the Die wae when the sample,da 
ec , l . . 


by the sample data. SR j ject and 
“to support it... p derstand what we mean by > aea pe ppa 
rr ae a ng. The rejection of a mae nat there is not 
accept in hypothesis tes a A terathedis is to conclu nate Panik 
false, The Caoeonee kr t it. Acceptance does not ü i 
sufficient evidence to rejec : eal 


. z ith hypo 

The basic concepts associated with hYP hypothesis, 

; i 

below: ; ive Hypothesis- A m ) 

‘ əynativ thesis 
16.1.1. Null and Altern is any hypo © o is true. d 
generally denoted by the symbol he assumption that The word null? 
tested for possible rejection under t ee 


} is the h i 
; sually H IS "< “the given 
in the term null hypothesis ate —_ be precise such as “the given 

is should a e 
30 effectyA null hypothesis — tive in cu f 
čin is unbiased" or “a drug is TN met 
vr “the difference between the p value: F 
typothesis is usually assigned a n 


is 
+n ‘all colleges 
2 ants in a 
We think that the average height ae a 


' 123 


a 
nich is to bes 


if 


1e 
le 


ill 
ut 
ae 
ae 


124 INTRODUCTION TO STATISTICAL THEORY ` 
ces, 


statement is taken as a hypothesis and is written symbolically as 
Hy : p=62". In other words, we hypothesize that u = 62”. 


An. alternative hypothesis is an ier is whi 
C y other hypothesis which we a 
when the null hypothesis Hp is rejected. It is customarily denoted M y 
at aAA Y 


or H4. A null hypothesis Hg is thus tested against an alternativ 
ative . 


ate H. For example, if our null hypothesis is Ho: H = 62” th 
our alternative hypothesis may be H, : u # 62” or l on 
Hy: < 62", L? i G2" or Hy: W >62” or 
16.1.2. ‘Simple and c ite 
sae omposite H thes i 
pee ypotheses. A 
ey + bia yy all parameters of the dibun ot 
l s ample, if the heights of colle J k 
Aiie me ge students are normal] 
rn $ jog o* = 4, the hypothesis that its mean Lt is, say, 62” te 
=- i w ” > Ale 
mdli ha have stated a simple hypothesis, as the ian and 
ke arene ; specify a normal distribution completely A si ae 
t- eaa in general, states that 0 = Oo where Opis tl i pane 
a parameter 0, (0 may represent H, p, fly ~ fl 4 ) ii 
ERSS ipia uh Had 2 etc.).. 
a apaa JS nol simple (ie. in which not all of 
a : esy led) is called a composite hypothesis, For i tens 
Ypothesize that H : |i > 62 (and 62 = 4) or H ii d7 or mstance, 
iH = 62ando? < 


t y S y v i 
l S 
he h rothesis become ac Mposite h pothesis because we cannot l no 


parameters jt > 62” and a em in either case, Obviously, the 
Specified values are Deij < 4 have more than one value l 
imatak ei 3 e sa The general form papel ie 
k 00020, that i ; posite 
does not fall -shori o that is the parameter @ d 
short of a specified value Q ,. The ebony al mair 
i "pt of simple and 


composite hypotheses app ics to bo u y 5 tive 
hypothesis i i = i 

šis A pathes is and alt i 
ernative 


62 or = 
i p 0.5, A hypothesis is called an 
lien Seago, apo aa e, than one possible values for 
í : i o . ; f 
:p > 0.5. A simple h i 
Y i ' ple hypothe 
* x o hypothesis is not nei & 
Ypothesis is a composite hypothesis. i 


tatistic which provides a basis 


has a pr one 
Obability (s; : iSt-statisti 
ility (sampling) distribution «7 statistic. Evory tedt-statistic 


' experimenter keeping in view 


FERENCE: HYPOTHESIS TESTING 425 


gaTiSTICAL b 


thesis: The sam pling distributions of the most commonly used test- 
gatistics are normal, t, chi-square or F. so 4 
16.1.4. Acecpiance and Rejection Regions. All possible values 
which a test-statistic may assume can be divided into two mutually 
exclusive groups: one group consisting of values which appear to be 
ronsistent with the null hypo-hesis, and the other having values which 
are unlikely to occur if Ho is true. The'first group is called the acceptance 
region and the second set of values is known as the rejection region for a 
test, The rejection region is also called the critical region. The value(s) . 
that separates the critical region from the acceptance region, is called the 
critical value(s). The critical value which can be in the-same units as the 


parameter oF in the standardized units, is to be decided by the 


to have in the hull hypothesis. 


16.1.5. Type I and Type Il 
hypothesis test, we derive the evidence 
test-statistic. There is a possibility that the sar 
to make a wrong decision. We may reject a nu 


is, in fact, truc or we may accept a null hypothes 
false, The former type is called an error of the-first kind or a Type I- 
` error, while the latter, an error of the second kind or a Type elidel The 
decision and the corresponding two types of error may be displayed in a 


_ tabular form’as below: . | 


Errors. When we perform a 


from the sample in the form of a 


ll hypothesis Hy, when it 


` Reject Ho 
(or accept Hy) 


` True Situation 


Wrong decision 
(Type-! error) 
Correct decision 
(No error) 


Correct decision 
(No error) 
Wrong decision 
(Type-ll error) 


Hg is true 


Hg is false 


A legal analogy will hel 
Type I and Type II errors. In a cour 

' the accused (the defendant) is inn 
tay be regarded as a kind of null h 
accepted. After having heard the evi 
judge arrives at-a decision. Suppose ©" 
(ie. Hy is'true), but the finding of the judge 

_ ejected a truce null hypothesis and in so doing, ilty (he. 
r If, on the other hand, the accused is, in fact pa ag te 
the finding of the judge is innocent, the judge has acer” 


t trial, the suppositi 
nt, This suppost 
ypothesis Hg that is to be rejecte 
dence presented du 


the accused is, 
is guilty. T 
has made ° 


oce 


the degree of confidence he (she) is willing 


mple evidence may lead us. 


is Hy, when it is actually . 


i iffer tween’ 
p in understanding the a so 
tion of innocence 
d or 
ring the trial, the 
*r fact, innocent 
ae judge has 
Type ! 2rror. 
vis false) and 
d a false nuil 


le 


426 i ` INTRODUCTION TO STATISTICAL THEORY 


- „Type I error. — f ; 

The probability of. making a Type I error is erzventionally denoteg 

by a (alpha) and that of committing a Tyrc uf error is indicated by B 
(beta). Thus a is the probability of rejecting Hy when Hp is true and Bi 

the probability of accepting H. when Ho is false (i.e. Hy is irio) a 

symbols, we may write Ron 


a = F (Type I error) = P (reject Hy / Hois kas ` 


‘P = P (Type II error) = P (accept Hy / Hy is false). 


` Let us consider two distributions: one’ under the null hypothesis ` 
Ho: = po (ie. distribution assuming H is true) and the other het : 
er 


alternative hypothesis Hu = H; (ie. distribution assuming H, ie tru e) 


H,:#=4, 


. HPS Hy 


“AG 7 i : a : 
l ‘ACCEPTANCE REGION REJECTION REGION 


The probabilities o ; NE 
of the diste sof & and B are the shaded and dotted areas respectively 


hypothesis, When our null hypothesi 


than or equal to C (th 


si on the alternative hypothesis H 1- In 
error) and ne of Type II error) we require Q (the 
Diente ts ihe c values of both Ho and u. When a 
an larger and when q becomes larger, 
etwi u s, s . 
rene. and i S there is an inverse relationship 
sample size, crease both a and ’ 
and B by increasing the 


hypothesis and by accepting „8 false hypothesis, he has commiiteg Gi : 


— 


gtAT 


gre matricul 


NCE: HYPOTHESIS J 
IsTICAL INFERE TESTING 127 | 


ple 16.1. The proportion of adults living in a small town who 


Exam ae Ata S 
ateg is estimated to be p=0.3. To test this hypothesis a 


m samp 
is anywhere from 2 to 7, we shall accept the null hypothesis 
p=0.3; otherwise we shall conclude that p+0.3. Evaluate o 


0.3. Evaluate B for,the alternatives p=0.2 and p=0.4. 
(P.U.,1.A/1.Se.1986) 


that 
assuming p= 


The null and alternative hypotheses. are given as 
Ho: p = 0.3 aad Hy ip# 03, 
Let X denote the number of adults who are matrigulates. Then the 
‘test-statistic has the binomial distribution with p=0.3 and n=15. The 
acceptance region, as given, consists of.all values from X=2 to X=7. 


“Then the criticai region is composed of two parts: all values less than 2 
and all values greater than 7. Thus the probability of making Type I 


. error, Le. OL consists of P(X<2) and P(X>7). l 


Hence a= P(X <2 when p=0.3) + P(X>7 when p=0.3) 
1 15 
X bx ; 15, 0.3) + X d(x; 15, 0.3) 


x20 x=8 


1 


1 7 : 
_ = Ð bœ; 15, 0.8) +01- J. bes 15, 0.3)] 
x=0 x=0 


0.0353 + [1 — 0,9500] (From Binomial probability tables) ~ 


= 0.0853 ; 


To compute B, the probability of Ty 
alternative. hypothesis. Now, we are given Ho . 
A Type II error results when a false null hypothesis is accepted. pine 
Type II error occurs if any value of the distribution under Hı ae . 

falls in the region X=2 to X=7, the acceptance region of the distribution 


under null hypothesis Hy : p=0.3. 
-© Hence B = P(2<X<7when Hy :P 


pe Il error, We need a specific 
: p=0.3; and H, : p=0.2. 


= 0.2) 


7 ' 
= = YS b(x; 15, 0.2) 
z=? 


7 1 
=- Fb 15, 0.2)- Zb0; 15, 0.2) 
x=0 


zž=0 ; 
= 0.9958 — 0.1671 (F 
= 0.8287 


rom Binomial probability tables) 


le of 15 adults is selected. If the number of matriculates in 


1e 
le 


ull 
ut 
he 
he 


"eis INTRODUCTION TO STATISTICAL THEORY 
128 ee LTE ORY 


pennt 


Similarly, when H; :p = 0.4, we have 
B= P(2SX<71, whenp = 0.4) . 


T 
= J b(x; 15, 0.4) 


x=2 


7 4 
= D bl; 15, 0.4) - J bG;.15, 0.4). 


_ x0 x=0 
= 0.7869 — 0.0052 = 0.7817 
16.1.6. The Power of.a Test with respect to a specified alternative 
‘hypothesis, is the probability of rejecting a null hypothesis when it is 
actually false. The power is the complement of B, the probability of 


committing a Type II error. It is therefore numerically equivalent to one 
minus B. Symbolically, 


Power = P (reject Hy / Ho is false) 
-1-8 


B and power of a test graphically, we show the 
est-statistic under both hypotheses Ho and H, as 


To represent a, 
distributions of the t 
below: f 


DISTRIBUTION UNDER Ho 


5 LET Hs p= 
REJECTION ( H:P=20) 
REGION 


ACCEPTANCE 
REGION 


K=20 
REJECT Hy | ACCEPT. Ho 
I 


DISTRIBUTION UNDER Hy. 
| 


(LETH #220) 


S527912 


4 

| INFERENCE: HYPOTHESIS TESTING 129 
TICAL e a a 

E ded arsa in the lower diagram represents power. This 

ae rege to the rejection region of the distribution under 

= generally increases with an increase in the sample size. A 

r 


B is small, is defined to be a powerful test. 


pability 
P tpe powe 
pifo which 


ving the probabilities of making Type i errorsfory arious 
lues under alternative hypotheses, is called an Operating 

‘tie Curve or simply the OC curve. The Power curve which 
i cola ded as the complement of the OC curve, snows the 
e eat rejecting the null hypothesis Ho for various values of the 
probabi 


parameter ð. 


A curve gl 
jrameteric va 


; > i pr vi y used as 
16 { 7 I i nificance Level of 1 
w . á a test 1s tk e robaol t 
a standard for rejecti ng a null hypothesis Ho when Ho 1S assumed to be 
This probability is equal to some small pre-assigned value, 
true. 


i i the 

ntionaiiy denoted by Q. The value @ is also known as sm oe if =n 
Ta rezion. It is note-worthy that the ae we ued 
no of Type I error are equivalent. The most freq aus 
p = A i.e. 5 percent ¿z 
values of q, the significance level, are 0:05 and ap as ae 
percent but occasionally 0.10 or 0.001 is used. By ed 2 6 ee ai 
there are about 5 chances in 100 of incorrectly re) are 95% confident 
hypothesis. To put it in another way, we Say that we 
in making the correct decision. 


nifi is a rule c- 
16.1.8. Test of Significance. A test of significance 3S 
.1.8. g 


i hether to accept 
procedure by which sample results are on ia A al on a test- 
or reject a null hypothesis. Such a procedure : a etistn aenlen 2y. 
Statistic and the sampling distribution of suc 


aie ignificant when the 
Value of the statistic is said to be statis ne less than the 
Probability of its occurrence under Ho is equa. 


s — H, im 
: rejection region, 420 

Significance level q, that is the value falls in ie valve falls in the 
this case is rejected. If, on the other oo significant. In this case, 
acceptance region, it is said to be statistically ualities for a test of 
Ho may be accepted. There are two desirable ‘actually true, it must 
Significance, First, when the null spr paan y, when Hy is ae 
ave a low probability of rejecting Ho, arid se ting Ho. It is to be rotel 
ise, it must have a high probability of relet™™ 
that the word significant is used in a special s 


INTRODUCTION TO STATISTICAL THEO 
RY 


130 
-tai -tailee Tests. ae 

16.1.9. One-tailed and Two ailec Tests. A test for which the 

entire rejection region 1S 


located in. only one of the two 
tails—either in the right tail eta 


in the left tail-of the sampling 
distribution of the test- 


Mi Y d mq 0 
statistic, is called a One-tailed 
’ REJECT Ho If 2<—2, 
test or One-sided test. .For 


example, if Z is a test-statistic, 
then the rejection region 
consists of all z-values which 
are greater than + z, or less. 
than -z, where is the size of 
critical region. A one-tailed test i REJECT H z 5 ra 
is used when the- elternative 
hypothesis H, is formulated in 
the following form: 

H,:6> 8) or 

11:8 <8, ie, 
H; is composite hypothesis, 


Zs 


“Zaj 0 12 
REJECT Ho If Z< -a2 or=>zay, 


If, on the other jecti 
other hand, the rejection region is divided equally between 


= 
a 
o 
© 
E 
an 
O 
=n 
= 
> 
oO 
n 
roe) 
3 
be J 
lor 
raf 
oq 
Q, 
= 
n 
ot 
£ 
fe 
Poa 
E 
E£ 
[e] 
5 
o 
<a 
= 
D 
© 
fag 
© 
n 
GE 
n 
ot 
A 
= 
g 
n 
fa 
2 
Ae) 
ct 
os 
© 
co 
© 
n 
+ 
~ 
n 


H :049, G i | 
1 aM i 
0» (Le. Hy is two-sided composite hypothesis) 


meaning thereb 
included. In eae both larger and smaller than 0, are to be 
(critical) regions are Peta: being norfnal distribution the rejection 
under the sampling ‘tei by shading the appropriate portions of area 
of critical region can ke poe in the figures shown above. The location 
Y ninl only after the alternative hypothesis 
portant to note that the one-tailed and the 


two-tailed tests di 
A S differ fae 
` size, only in location of the critical region, not in the 


Weight ofa Populati 
tion of 3 
sample 036 people, find People is 140 lbs. Using O= 15 lb, @=0.05 and a 


rej 


REJECTION 
REGION 
LD Z 


ENCE: HYPOTHESIS TESTING 131 


ypris TICAL INFER ae 
; es of x which would lead to rejection of the hypothesis, 


(a) the valu 
b) P, the probability of Type II error, if p = 150 lb. Use a two 
md í sate, (P.U., M.Sc., 1970, 86) 
ide ; . : 
° We are given the following information: 
Ho: = 140 lb, Oo = 151b, & = 0.05 and n = 36. 


o find the values of X (the critical point) which would lead to 


(a)T ah 
140 lb, we use the test statistic 


ection of the hypothesis Ho : |t 
mal population) given by 


goon,” 

o/\n 
ed, so there would be two critical values. 
0.05, the critical values of Z 


(assuming nor 


Since the test is two-sid 
Corresponding to the significance level & = 
from the table of normal curve are —1.96 and 1.96. Thus 

M TE ON 
To 15/436 
Simplifying, we get X% = 135.1 and 144.9 as the two critical values. 
Hence the hypothesis Hy: H = 140 Ib will be rejected ifx < 135.1 lb 
ork > 144.9 Ib. 
(b)A Type-II error can be committed only by accepting a false Ho. 
140 lb will be false if u takes a value greater 


The hypothesis Hy: H = 
than 140 Ib. Given H} : }t = 150 lb so that Hy : p = 140 Ib becomes talse. 
140 lb <false) when 


Therefore, the probability of accepting Hy : H = 10 b > 
H: p = 150, ie. the probability of a Type-II error 1s indiontes oy the 
dotted area B in the figure shown below. To compute this area, we use 
the distribution under the alternative hypothesis H,: H = 150 lb. 


H,:#=150 


+p=14 
Hy +” 0 


a/2 


oed Ey 
B. T = "150 
135.1 o 144.9 
SEIECT Hy] ACCEPT Ho REJECT Ho 


| 144.9 — 150 n 
ER = oS = -2.04, and 
Now, atx = 144.9, we find z 15 / J36 


w 


: = = —5.96. 
atx = 135.1, we find z 15/36 
Thus B= Area between z = ~5.96 and z = —2.04, 


is u = 150 = 0.0207 


This is the probability of accepting the null hypothesis H, 
when, in fact, the alternative hypothesis H 1: H = 150 Ib is tru 


n 


INTRODUCTION TO STATISTICAL THEORy 
LA 
co ER 


Le. area in the 


acceptance region of the distribution under Ho, when H, 


o'U=140 Ip, 


e. 


= 0.05, n = 4 and 62 = 15, 


Le, C= 304 (1.645) (15/4) = 30 + (1.645) (3.75 ) 


= 30+ (1.645) (1.94) = 83,19 

Since 4 Values of u 

associate the variables b &, 
alternative (zy 1) distributions, 


Using the alternative (H) distrib 
the value of B (say B 


in the alternative 
v X3 and Ay 


ution with- HU = 
h = 31) as 


Dies 31 = P(Type-ty error / u = 31) = 


f, 33.19 ~ 31 
= PIZ : 
ea ) 


P(X, < 33.19) 


=P(Z< 1.13) = 0.8708 


Again using the H -distribution With = 


Bisa P(Type-1] error / j -~ 


7 (2 < 33.19- 3 
1.94 


32) = PO, < 33,19) 


2 
) = PZ < 0.61) = 0.7291 


hypothesis are specified, so we 
with each of the fcur 


31, we calculate 


TICAL INFERENCE: HYPOTHESIS TESTING 133 
S 


jmilarly, rs 

= B,, = 34 = P(Type-II error / U = 34) = P(X, < 33.19) 
p= 

a) = PZ < ~0.42) = 0.3372 


By = 36 = P(Type-Il error / pu = 36) = P(X, < 33.19) 
p= 


- p(z < 


1.94 
The power of the test for u = 
required powers are: 
- F (31) = 1-8 itt = 1 — 0.8708 = 0.1292, 
` P (32) = 1- =32 = 1- 0.7291 = 0.2709, 
P, (84) = 1-B,.-34 = 1 — 0.3372 = 0.6628, and 
w 
= 1- 0.0735 = 0.9265. 
36) = 1-ß =36 1 i l 
eo of ie anes curve for the test is shown below: 
es 


=P (z g a) = P(Z < —1.45) = 0.0735 


Hi say P,,(u,) is given by 1-By 4 


“Power curve 


31 32 w 36 a 


ified. We may 
e size when a and B are speci een the twa 
p 16.1.10. 6 Ao cle for JAEN sie iar 
ser iie nis Types of error are specifie pawei 
hypotheses when = ~ ee Ho and the alternative hypo neni 
i fries did ae distributions from two 
Poesy. There are then two s 


2 o°) 
Populations N(uo, Sp) and Nehi, 00- 


HiK= By 


H p= g 
pitt 0 


ull 

put 
*he 
the 


134 INTRODUCTION TO STATISTICAL THEO; 
SS. aaa RY 


The rejection region under the [lp-distribution is the area @ (o , 
test) and it lies to the right of the point x. The type II error und ieee, 
distribution is represented by the area B and it lies to the i the H- 
ne ie. itis associated with the area under the |L a ete of the 
acceptance region established from the Ho-distribution. We ea 

5 that 


every value to the right of x, i.e. falling i iti 
, Le, g in the critical region 
ry , calls for the 
e 


rejection of the null hypothesis H 
ectior o and every value to 
am in a acceptance region, calls for the a p oa 
nu esis Ho. When Hg is true, the upper limit of the acc ence 
e point x) is determined by the expression es 
x— Ho ` 


pee as Oo 
=2) or x= oO 
Oo / In 0 [lo + Zo a ‘i 


where 20 1S the nòrmal d R f th 
evla 
f te corresponding to the lower limit 0 e 


nder alternative hyp th bp p g ea ual t 
l d ter: V otnesis H corres ondin to an ar equal to 


the specified magni 
gnitude of B is the 
acceptance region und 
er H 
0 


Considering H ; 
: p We agai rmi 
vesion by gain determine the upper limit of the acceptance 


x= it 
— = = -z a i 
SIRT Cue sign as the critical point x lies 
t 
i - n o the left of }1,) 
= fly — 2, a 
where z, is the i 
a n i 
jaap he ormal deviate corresponding t imi 
= wn Ne g to the upper limit of the 
ce the upper limi 
oe it of the acceptance region can be represented 
le] 
x= 2 N 
Ho + a and x= h] -z = 
Equati i 
quating these two values of x, we get Í 


oO 


(oJ 
a Ho * 20 Fe Why 2 
olving for n, we obtain " 
n = 0020 + 9424)? 
(Hy = Ho)? 
| So + z,)? 
(Hi = Lg)?” ak ass: 


STATISTICAL INFERENCE: HYPOTHESIS TESTING 135 


The required sample size for a two-tailed test can be determined in 
a similar way. . 

Example 16.4. A firm wishes to test the hypothesis that the 
average monthly wage of its shop employees is Rs. 180 with a standard 
deviation of Rs. 4.50. They wish to run a risk of only 0.05 of rejecting the 
null hypothesis when it is true, and a risk of 0.05 of accepting the null 
hypothesis if the average is as high as Rs. 182 or as low as Rs. 178. How 
large a sample should be taken? (P.U., M.Sc. 1986) 

A risk of only 0.05 of rejecting the null hypothesis when it is true, 
implies that the probability of a type I error, ie. a = 0.05 and a risk of 
accepting the null hypothesis if the average is as high as Rs. 182 or as 
low as Rs. 178, implies that the probability of a type U error, ie. B = 
0.05. The distributions of average monthly wage for alternative values of 
u and the acceptance and rejection regions at the specified levels of & 
and B are represented by normal distributions in the following figure: 


Hy; p=178 _ Hy: u=180 "Hy: p= 182 


178 


REJECTION REGION ACCEPTANCE REGIO REJECTION REGION 


two tailed test at U = 0.05 under the 


= 1.96 and the value of the variable 
a tables is Z} = 1.645. 


The value of the variable Z for 
[lp-distribution from area tables is Zp 
Zat B = 0.05 under the |1,-distribution from are 


Substituting these values in the formula 
(zg + 21)? 9? 
= 9 
(Hi — Ho)” 
me (1.96 + 1.645)? (4.5)? 
(182 — 180)? 


, we get 


: 2 2 
_ gem 4.5)? «65.19 


6, the next higher integer. 


16.1.11. Formulation of Hypotheses. To formulate the null 
(Ho) and elternative (Hy) hypotheses which are statements about 
parameters. not statistics, is perhaps the most difficult task. The 
hypotheses must be formulated in such a way that when one if “ue, the 


Hence the sample size must be at least 6 


136 INTRODUCTION TO STATISTICAL THEORY 
other is false, i.e. Hy and Hi are opposites. The basic rule in formul 
hypotheses is to make H, the hypothesis that the experimenter thi 
true or the hypothesis that he(she) wants to establish as true, 


ating 
nks is 


Sometimes hypotheses are formulated from an experimenter’ 
attitude toward a claim. If the experimenter wishes to establish ac aes a 
claim with substantive support of sample information, then the cl ertain 
taken as the alternative hypothesis H, and its negation becomes the n i 
hypothesis Hp. If the experimenter wants to disprove or refute the aa 


then claim is made Ho. If the problem simply says to test a fie ib 
would be interpreted to mean that the claim is to be disproved yai 
would be made Họ). a nd 


In situations where we want to test for a change in the value -of 
‘of a 


parameter, the old or accepted value of par. i 
! parameter is used for 
includes the new value. ivi a as 


The null hypothesis Ho will always contain some form of an equalit; 
sign such as =, < or >, If Ho contains the exact form of equality sign i l 
then H will have the not-equal sign #. If Ho is stated using the fuss. 
than-or-equal-to sign <, then H 1 Will contain the greater than sign > 
and if Ho contains the sign : i 


S h , 
oe 2, then H} will have the sign <. Examples 


Ho: u = 62, (say), H} : p + 62 
Ho: WS 62 
Hy: u 2 62 


Hy: > 62 
Hy: < 62 


It i i ; 
is of importance to note that by rejecting the null hypothesis 


Ho: = i 
o:H = 62 (and accepting the alternative Hy: `> 62), we are 


automatically rejecting all values of L that are less than 62, because the 


null and the alternative h i 
ypotheses being opposit ; i 
values for the parameter u. ia 


In general, if 0 


is a spei j 3 
Biden a pe-ified value of a parameter 0, then the null 


otheses in case of two-tailed test, take the form 
Hy:8 = 6 and H, :0 #05, 


and in case of one-tailed , in ei 

a. ed test, are stated in either of the following two 
(i) Hy:8<0, ard H,:98 > Oo, 
(ii) 


Iy:02>65 and H,:8 < Oo. 


re Y E i 


STATISTICAL INFERENCE: HYPOTHESIS TESTING 137 
—_—_——- $e 


N . 
The sampling distribution of O in case of an inexact null hypothesis 


(Ho:9 < O or Ho: 9 > Op) is not defined and hence we cannot set up 
the acceptance and rejection regions. In such a case, we would take the 
null hypothesis as if it is an exact one, i.e. Hy:0 = Qo. 

16.1.12. General Procedure for Testing Hypotheses. The 
procedure for testing a hypothesis about a population parameter involves 
the following six steps: : 

(i) State your problem and formulate an appropriate null 


hypothesis Hg with an alternative hypothesis H}, which is to 
be accepted when Hp is rejected. 


(ii) Decide upon a significance level, @ of the test, which is the 
probability of rejecting the null Hypothesis if it is true. 


(iii) Choose an appropriate test-statistic, determine and sketch the 
sampling distribution of the test-statistic, assuming Hg is 
true. 


(iv) Determine the rejection or critical region in such a way that 
the probability of rejecting the null hypothesis Ho, if it is true, 
is equal to the significance level, &. The location of the critical 
region depends upon the form of H, . The significance level 
will separate the acceptance region from the rejection region. 


(v) Compute the value of the test-statistic from the sample data 
in order to decide whether to accept or reject the null 
hypothesis Ho. 

(vi) Formulate the decision rule as below: 


(a) Reject the null hypothesis Ho, if the computed value of the 
test-statistic falls in the rejection region and conclude that 
Hi, is true. 
(b) Accept the null hypothesis Ho, otherwise. 
When a hypothesis is rejected, we can give a measure of the 


strength of the rejection by giving the P-value, the smallest significance 
level at which the null hypothesis is being rejected. 


16.2 TESTS BASED ON NORMAL DISTRIBUTION 


Suppose we wish to test a hypothesis that a parameter O of a normal 
distribution has some specified value Qo. We draw a random sample of 
size n from the population and calculate O as an estimate of O. It has 
been showu in the previous chapter that the sampling distribution of 9 is , 
a normal or approximately normal with mean O and standard deviation ` 


138 INTRODUCTION TO STATISTICAL THEO 
ee 


-0 
is N(0, 1 a i 3 
1 (0, ). If the null hypothesis Hy A 


Op. Then the variable Z = 2 
99 


A 


0, 


_ O = Où is true, then is N(0, 1) and is used as the test-statistic f 
or 


So 
testing the hypothesis Hy : 9 = Og. - 


If the significance le el sa then t of 
V 1 ; the critical region w1 
g 1] consis (0) 


(i) less than —z,)9 and greater than 2,9 in case of two-tailed test 
(ii) less than —z, or greater than z, in case of one-tailed test 


Cc ity . 
ritical values of Z for the most frequently used values of a, are 


given below: 
Two-tailed test_ One-tailed test 


7 


Significance 


level (a)-~A 
+1.28 =z 
a 
+1.645 = z, 
+2.33 = za 


T His : 
he decision rule is then formulated as below: 


J stic 0 
Re ect the null hypothesis A when th z value of the stati 
0 e 6 
exceeds these values. Accept then ull hypothesis Ho, other wise. 
a chapter, we deal with the fo lowing tests of ypot neses: 
1 To test whether the mean a of a normal popul > q 
ation, is equal 
to a specified value Lo, when the p pulation stand. rd 
0 a 


deviation © is k 
nown. i 
nowi. n. Symbolically, Hy 2 i= Ho when © is 


(ii) To test : 
whether the mean l of a normal population i 
empra population is equal to 


i when the populati 
is not known and sample size is _— enemas 
(iii) To te ; 
st whether the mean Ht of a non-normal 
- al population is 


equal toa i 
specified value Ho, When the sample size is ] 
s large. 


(iv. 
) To test whether the difference between m 
eans 


distributions (11, of two normal 


=H) is equal to . 

" a specified v; 

be zero), when Gy «nd Gy are known, ie ines A 
when O; and ©; are known. ve ae 


o (Ag may 
Hy-bly = Ao, 


STATISTIC Abe INFERENCE: HYPOTHESIS TESTING 139 


(v) To test whether the difference between means of two normal 
distributions is equal to specified value, Ag when O; and Og 
are not known and sample sizes are large. 

(vi) To test whether the difference between means of two non- 

i normal distributions is equal to a specified value Ap, when 
sample sizes are large. 
(vii) Other tests based on the normal distribution for large sample 
size such as tests of standard deviation, proportions, etc. 
16.2.1. Testing Hypothesis about Mean of a Normal 
Population when O is known. Suppose a random sample of size n is 
n with mean having a specified value Ho 
and a known standard deviation o. The sample mea: is given by x. We 
wish to determine whether the sample accords with the hypothesis that 
the population mean H has the specified value Ho. For this purpose, we 


drawn from a normal populatio 


— i 
employ the normal distribution test Z= Tk and the procedure 1s 
A o/\n 


outlined below: 
(i) Formulate the null and alternative hypotheses about |. Three 
possible forms are 
(a) Ho: = Ho and H,:h#Ho 
(b) Ho: tS Ho and Hı: H> Ho 
(c) Ho: 2Ho and H: H< Ho 
ie. take & = 0.05 or 0.01. 


xX-—tt 
Gii) The test-statistic in this case will be Z = ŽŽ- Ho, Under the 
o/ afin 


Gi) Decide on significance level æ, 


null hypothesis, Z has a standard normal distribution. 


region, which actually depends on the 
from the table of areas under the 
exactly equal to Q. The 
ding to different alternative 


Gv) Determine the rejection 
alternative hypothesis, 
normal curve by finding areas 
rejection regions for Hg correspon 
hypotheses are given below: 


the rejection region will be 


When the alternative hypothesis is 


(a) H,: HZ Ho (two-sided) Z < —Zg/2 and Z > Za/2 


Z > Za 


(b) H: H> Ho (one-sided) 


Z < mha 


(c) HH < Ho (cne-sided) 
E OO 


INTRODUCTION TO STATISTICAL THEORy STATISTICAL INFERENCE: HYPOTHESIS TESTING 141 


140 
i le, when H; is u # Ho, the areas in the two tails for q = 0.05 ` | Gi) We are given the significance level as a = 0.10. 
wan ie < 4 96 and z > 1.96, as the critical values Of Z are xX- 
l would be aa aa > - = 1.96. In this text, where the alternative Í (iii) The test-statistic to be used is Z = aie which under null 
Sede has not been stated explicitly or implicitly, a tWo-sideg ` hypothesis has a standard normal distribution. 


alternative hypothesis has been assumed. 


(v) Calculate the value of Z from the sample data. (iv) The critical region for a = 0.10 is Z > 1.28 


(v) We calculate the value of Z from the sample data as 
(vi) Decide as below: | 


Reject Hp, when the calculated value of Z falls in the rejection | ga. 24ml = 5% 3.506 | 1.29 ; E N 
region, otherwise, accept it. | V70 / 413 : i 


In case of rejection, the decision would be that 4 differs from Hy. } (vi) Conclusion. Since our calculated value z = 1.29 falls in the 


(Example _16.52A random sample of n=25 a a Wena aut eee gL el ca 
this sample be regarded as drawn from a normal population wiih i i sample does not come from the given population. 
H=80 and OAI ROWN HOt ? a gee 7 16.2.2. Testing Hypothesis about Mean of a Normal 
(i) We formulate our null and alternative hypotheses as í ` Population when © is unknown and n > 30. When the population 
Ho: = 80 and Hy: \t#80 (two-sided) standard deviation © is not known, we use the sample standard deviation 
: S as an estimate of the true but unknown population standard deviation. oS 


GD We set the significance level ato = 0.05. For large sample size (n > 30), the central limit theorem allows us to 
ms y_ : : be patna vp: sad l canal suse 
(iii) The test-statistic to be usedig Z = X Ho , which under the assume that the sampling distribution of X 1s approximately normal wit 


o/y 


ae a mean of }t and a standard deviation af, In other words, when o is 
null hypothesis Is a standard normal variable. yn 


(iv) The critical region for ct = 0.05 is |Z| > 1.96. The hypothesis unknown but n is large, we replace T ii ie aoe 
will be rejected if, for the sample, |Z| > 1.96. X j i 
= ATHo wni "oxi i z 
@) “Weeatance. nied? etic — data Z= s7 Ja » Which is approximately N(0, 1), is then used as the test 
= Bg eae ae statistic to test the hypothesis Ho : H = uo. The rest of the procedure is 
1/425 “7 = 2.14, i the same. 


(vi) Conclusion. Since ou 


; eae braai eile da thes Example 16.7. The marks obtained by students at a large number 
critical region, so we 


reject our ; Sait oe of colleges are known to be normally distributed with a meañ of 25 
4 full hypothesis “nsp a 60 random sample of 36 students showed an average number of marks of.27 


and accept H, - L#8 = a 
with ¥ = 83 a be rae a Aiea that the sample with a standard deviation of 5. What conclusion should be drawn? 
$ r S drawn fr i j 
with u=80, a (i) Since it is given that the average number of marks obtained by 


Example 16.6. Test the hypoth students at a large number of colleges is 25, we therefore have 


= : : esis that . 
“population with known vatiance 70 is the mean of a Normal i ; Hy: |= 25 and H|: #25 (two-tailed) 


= 31, if a sam ] f sj 
T = 34. Let the alternati 3 pie of size 13 gave i 
ve hypothesis be Ay:u> 31, and let q = 0.10. (ii) Let us specify the significance level at a = 0.05. 


as 
Hy: H= 31, ang H :u> g 


(i) We formulate our hypotheses 


(one-sided) 


INTRODUCTION TO STATISTICAL THEORy 


s 5 Je, when H} is p # Ho the areas in the two tails for q = i 
eyr if z < -1.96 and z > 1.96, as the critical values of Z are 
: eee ca a 20,925 = 1.96. In this text, where the alternative 
Ta E has not been stated explicitly or implicitly, a two-sided 
alternative hypothesis has been assumed. 
(v) Calculate the value of Z from the sample data. 


(vi) Decide as below: 

Reject Ho, when the calculated value of Z falls in the rejection 
region, otherwise, accept it. 

In case of rejection, the decision would be that u differs from Mo» 


Example _16.5./A random sample of n=25 values gives 7=83. Can 


this sample be regarded as drawn from a normal population witk mean 
thi 


(i) We formulate our null and alternative hypotheses as 
Hg: |t = 80 and Hı: #80 (two-sided) 
Gi) We set the significance level ata = 0.05. 


(iii) The test-statistic to be used is Z = Lod “which under the 
o/ yn ' 


null hypothesis is a standard normal variable. 


(iv) The critical region for & = 0,05 is |Z| > 1.96. The hypothesis 
will be rejected if, for the sample, |Z| > 1.96. 


(v) We calculate the value of Z from the sample data as 


83 - 80 “8x5 
ES yee 
N 7/ Rs 7 2.14, 


(vi) Conclusion, Since our calculated value z = 2.14 falls in the 
and accept Ay: 
with X= 83 cann 
with |1=80, 

Example 16 6. Test 
} $ ob. the h i 
population with known variance ie that the mean of a normal 


= 31, i é 
X = 34. Let the alternative hypothesis be H niet ARA S 
1 


i ‘H >31, = 
(i) We formulate our hypotheses Poeta a 


as 


Ao: = 
o: Ht = 81, ang Arius 33 (one-sided) 


H=80 and o=7? known HOt 5 QA ey mo 


STATISTICAL INFERENCE: HYPOTHESIS TESTING 141 


(ii) We are given the significance level as q = 0.10. 
(iii) The test-statistic to be used is Z = Žal which under null 
o/\n 


hypothesis has a standard normal distribution. 


(iv) The critical region for & = 0.10 is Z > 1.28 
(v) We calculate the value of Z from the sample data as 


34 - 31 3 x 3.606 
z = = = 129 
z v70 /vVi3 8.367 

(vi) Conclusion. Since our calculated value z = 1.29 falls in the 
critical region, so we reject our null hypothesis Ao: wu = 31. 
We may conclude that there is evidence at the 10% level that 

sample does not come from the given population. 
16.2.2. Testing Hypothesis about Mean of a Normal 
Population when o is unknown and n > 30. When the population 


standard deviation o is not known, we use the sample standard deviation 
S as an estimate of the true but unknown population standard deviation. 
For large sample size (n > 30), the central limit theorem allows us to 


assume that the sampling distribution of X is approximately normal with 


a mean of |l and a standard deviation of In other words, when o is 
n 


unknown but n is large, we replace 2 with 2 and the variable 
a A a 


Z= Aom, which is approximately N(0, 1), is then used as the test- 
S/n 


Statistic to test the hypothesis Ho : H = Ho. The rest of the procedure is 
the same. 

Example 16.7. The marks obtained by students at a large number 
of colleges are known to be normally distributed with a meam of 25. A 
random sample of 36 students showed an average number of marks of.27 
with a standard’ deviation of 5. What conclusion should be drawn? 


G) Since it is given that the average number of marks obtained by 
students at a large number of colleges is 25, we therefore have 


Hy: }t = 25 and H|: #25 (two-tailed) 


(ii) Let us specify the significance level at & = 0.05. 


Vin 


STATISTICAL 
INTRODUCTION TO THEORY STATISTICAL INFERENCE: HYPOTHESIS TESTING 143 


142 P y 
B eg is not known and the sample pa n > 30, we use S in (iv) The rejection region is Z > Zo.o5 = 1.645 
ses ecaus — Un . i l 
Gii) sia test-otatistio te = cae , which has an (v) Computing the value of Z from sample information, we find 
z n 
place of o. Thus the t / 2600 - 2500 100 
= Su 


istribution under the given nul] z =e 
al dismay 500/100 50 ~? 


approximate standard norm 
= | = (vi) Conclusion. Since the calculated value z = 2 falls in the 
| the auon region, we therefore reject Ho, and may conclude that 


1y he J gi i a .96. 
( ) T rejection region 1S IZI 21 
nple evidence 


find 
ting the value of Z, we 
v) Computing y bzi Example 16.9. A random sample of 100 observations from a 
- 27-25 _ ae 2.40. A population known to be non-normal yielded the sample values ¥ = 182 
5/436 af ie , and S? = 299. Test the hypothesis Ho: H < 180 against H, : H > 180. Let 
(vi) Conclusion. Since the calculated value z ¥ 2.40 falls in the a = 0.05. 
rejection region, we therefore reject Hy : 4 = 25 accepi : (i) We are given our hypotheses as 
H, : u # 25. On the tasis of se snr may conclude that Ho: H < 180, and 
: appears to . 
this sample of students app er ae Ay: Uy, > 180. (one-sided) 
“Tusiing Hypothesis about Mea à m PENR : 
Po cietis wien araa size is large. The central limit theorem Gi) The significance level is œ = 0.05. 
o 5 m r . pr: eee ° a A 
a us that for large sample sizes, the sampling distribution of X is (iii) Since the sample size (n=100) is large enough to allow to assume 
approximately a normal even though the population sampled is non- g that the sampling distribution of X is approximately normal with 
X- Xx- a S 
normal. That is, the random variable % “ae or Z= cae V4 mean =}t and standard deviation = ie » We therefore use the 
according as © is known or not known, is approximately standard a variable Z = Š Ho as the test-statistic under the given 
and is used as the test-statistic to test the hypothesis Hy ; p = ko- The l S/aln 
rest of the procedure is the same. hypothesis. 
Example 16.8. A random sample of 100 workers with children in (iv) The critical region is Z > 20.95 = 1.645. 
day care shows a mean day-care cost_ofRs_-2,600- and a standard (v) Here [Up = 180, ¥ = 182, n = 100 and S = \/299 = 17.29 
deviation of Rs. 500. Verify the department’s claim that the mean . my 182 — 180 20 
exceeds Rs. 2,500 at the 0.05 level with this information. an z= h i 17.29 7 1-16 
We make H, what the department claims, that the mean exceeds r =e ve iioi l 
4 (vi) Conclusions. Since our calculated value 2=1.16 falls in the 


Rs. 2,500, and take the negation of its claim as Ho. Thus, we have 
; acceptance region, so we accept Ho: <180 and reject AZ,:> 180. 


(i) Ho: H< 2,500 
16.2.4. Testing Hypotheses about Difference between Two 
Hı: H > 2,500 (exceeds 2,500) Population Means. To test hypotheses about the difference between 
Gi) Weare given the significance leiehate = one two population means, we deal with the following three cases: 
Gi Th a ; Dan (1) Both the populations are normal with known standard 
ili) e test-statistic, under Hp is deviations. 


(2) Both the populations are normal with unknown standard 


S/n’ ` deviations. 
which is approximately normal as n = (3) Both the populations are non-normal, in which zase, both sample 


= 100 is lar o edi ily lar 
make use of the central limit theorem 100 is large enough t sizes are necessarily large. 


P 


začu 


INTRODUCTION TO STATISTICAL THEORY’ - 


144 Let X, be the mean of the first random sample of size n, 
. Le 

Case 1 1 vation with a mean of Hı and a known standard 

from a norma! a be the mean of the second random sample of size the 

deviation Oj, and X9 ee a tae a can 


ulation wi 
from another normal pop ling distribution of the difference 


iation Oz. Then the samp ‘ 
ae el distributed with a mean of pı7H2 and a standard 
1X is 


2 o2 
iation of i! + —2 In other words, the variable 
deviation o my he 
(X, - Xp) - (HH) 
E c Ma a ee 
a= 2 2 
o %% 
— + — 
ny Ng 


normal, no matter how small the sample sizes are. 
Hence it is used as the test-statistic for testing hypotheses about the 
difference between two population means. The difference is equal to, less 
pecified value Ay (Ay may equal zero). Then the 


is exactly standard 


than, or greater than a's 
hypotheses are 


(i) Ho: He = Ao, and Hy: Hy — He * Ao 
(ii) Ho: 4y— H< Ag, and Ay: Hi- Hg > Ao, 
(iii) Hy: Hi- Hg 2 Âp and Hy: Hi- Hg < Ao, 


When the null hypothesis Hp : H} — Hg = Ag is true, the test-statistic 
becomes 


(X, - Xq) ER Ao . . 
Z = —— = which is exactly standard normal. 


o 6 
a 
ni Ng 
The procedure for testing the h i 
oth i = = 
stated agbelow: conn ae Ga “ae Pe 


(i) „Formulate the null and the alternative hypotheses: 


Tae Ws = bho , 
o ` B1 = He = Ao against the appropriate alternative. 


Y 


(ii) 


(iii) 


(iv) 


(v) 
(vi) 


Example .1 
population with variance 24 gave X, = 15. A second sample of size 2 


STATISTICAL INFERENCE: HYPOTHESIS TESTING 145 
STATS T AE ee ESTING: NAS 


Decide on the significance level a. 


The test-statistic Z, which under Hy becomes 


is exactly standard normal. 

The rejection region is 
Z< -zaja and Z > Zujņ when H; is Hl — [ly # Ag, 
Z>z,, when Hy is jt, — Hg > Ag, 
Z<—z,, when Hy is H; — ly < Ag, 

Compute the value of Z from the sample data. 

Decide as below: ` 


Rej if z falls in the critical region, accept Hg, otherwise. 


0. A random sample of size 36 from a normal 
ghd 


from another normal population with variance 80 gave ¥=13. Test ` 


Ho:Hı—H2=0 against Hj:1l)—-H,#0. Let a = 0.05. 


(i) 


The null and the alternative hypotheses are ane not 
Ho: [ty — Hg = 0, and 
Hy: [ly — [lg #0. (two sided) 
The significance level is & = 0.05. 
The test-statistic under Hg is 


which is exactly standard normal. 


The rejection region is IZI > 1.96. 
We compute the value of Z from the sample data as 


15- 13 T 
24 80 43.5238 1.88 
36 ` 28 


49. qoplat ce u ean 


Pogubtiov uent 


INTRODUCTION TO STATISTICAL THEORy 


146 
` Conclusion. Since the calculated value z = 1.06 does not fal] ii 
i onclu . . . = 5 
(vi) the rejection region, so We do not reject Ho: Hy — Hgo = 0. 
11. The two samples A and B detailed below, were 
Example 16.11. of standard,deviation 0.8. Test whether 


taken from normal populations : 
the difference of means 1S significant. 


11.6, 12.7, 12.9, 
12.4, 124, 13.9, 


13.6, 14.8 
14.7, 149 156 
(P.U., B.A./B.Sc. 1983) 


15.5, 


14.2, 


10.5, 
11.3, 


We formulate our null and alternative hypotheses as 


Ho: H- He = 9 and H; : H; —H2#0 (two-sided) 


(i) 


(ii) We set the significance level at & = 0.05. 
(iii) The test-statistic to be used under Hy is 


x. =X X,-X 

= X, — Xp es O] = Oy = 0) 
o o L2 
ny ng ny ng 


which is exactly standard normal. 
(iv) The critical region is |Z] > 1.96. 


(v) Computations: Here n, = 7, n = 8, © = 0.8, 


X= 2 298 12.8, and 
ny 7 
o 2X 
X= 8. 1094 13.675 
No 8 
12.8 — 13.675 = = 
P = 20.875 20.875 1], 
1 1 (0.8) (0.5176) 0.414 
0.84 5+5 
7 8 
(vi) Conclusion. Since the calculated value z = —2.11 falls in the 


rejection region, so we reject Hy : Hi — [lg = 0. On the basis of 
the evidence, we may conclude that the difference between 
means 1s significant, 


I 


Case 2. When independent random samples of sizes n, and ng are 
drawn from normal populations with known means Hı and [lg but 
unknown s‘andard deviations, the Sample standard deviens S, and Sz 
can be substituted for the Population standard deviation o, and O2 If 


STATISTICAL INFERENCE: HYPOTHESIS TESTING 147 


sample sizes are large (ni, ng > 30), we can assume that the sampling 
distribution of X; — Xp is approximately normal with a mean H; — Hg 
I 2 


S 
si 1 2 
and a standard deviation a, + mz That is, the variable 


2 
Si sS, 
ne e 
ni ng 


is approximately N(0, 1). The test-statistic to be used under the 
hypothesis Ho : H4 — Hg = Ay becomes 


(X, — Xy) -= âo 
oe 


Z ? 
2 2 
Si S 
Be 
ni ng 


which is approximately standard normal. The rest of the procedure is the 
same. 


Example 16.12. A form of intelligence test was given to random 
samples of soldiers and sailors in a certain country. The following results 
Deviation 


were recorded: 
Samples 
2.43 


populations of scores to be normal. What conclusion should 


Mean Score | Sample Standard 


Soldiers 


Sailors 


Assume the 
ve drawn? 


(i) We must decide between the hypotheses 
Ho: y-[,=0, i.e. there is no difference between the mean scores, 
Ay: —[lo#0, i.e. there is a significant difference between means. 


(ii) We choose the significance level at x = 0.05. 


Gii) Since the population standard deviations are unknown and the 
Sample sizes are large, we therefore substitute the sample 
Standard deviations for the population standard deviatio::s. Then 
the test-statistic to be used under Hg is 


X, - Xp 


INTRODUCTION TO STATISTICAL THEORY 


148 ard normal. 


which is approximately stand 


lV The cr re ion IS Z 2 1.96. 
( ) C jtical g1 | | 

below: 
) 


12.99 — 12.78 


; 2 n— 
z= — T343) _ (2.48)? 
“a32 * 615 


0.21 


a ae 
0.0178 + 0.0100 


Conclusion. Since the calculated value z=1.24 does not fall in 
the rejection region, therefore we accept the null hypothesis 
Ho! bly ~ He = 0 at 5% significance level. In other words, on the 
basis of the evidence, we may conclude that the difference |. 
between mean scores is insignificant or merely due to chance, 


(v. 


-021 124. 


0.17 


(vi) 


Case 3. The populations are non-normal and the sample sizes are | 
sufficiently large. It is interesting to note that the central limit theorem | 
also applies for the sampling distribution of the difference between two | 
sample means. Thus, if sufficiently large samples are drawn from the 
non-normal populations, the sampling distribution of the difference | 


X,—X, will be approximately normal with a mean of p} — [ly anda i 


oF a 2 2 

G2 Si S3 ' 
standard deviation — = —+— , according as the | 
nye ng my de | 


population standard deviations are known or unknown. In other words, 
if the sample sizes are sufficiently large, the variable 
(X, -X,) es (Ly = Ho) | 

S.E. (X 17 R) 


is approximately standard normal, regardless of the form of the 
population distributions. The test-statistic under the hypothesis 


Ho: by 


= Hlg = Ap then becomes 

= Ce = Xo) ES Ay 
S.E. (X,-X,) ` | 

which is approximately N(0, 1). The rest of the procedure for testing aa 

null hypothesis Hy: k 


— Hg = Ap is the same. 
Example 16.13. A random sample of size 40 from a non-normal 
population yielded the sample values x, = 70.4 S? = 31.40. Anothe! 
i par Di : 
random sample of size 50 from a second non-normal population pa 


STATISTICAL INFERENCE: HYPOTHESIS TESTING 


the sample values , = 65.3, SŽ 
Hy: H17 He > 2. Let & = 0.05. 


Where g=1 


is 


149 


= 44.82. Test Ay: p; - Ho S 2 against 
(i) The null and the alternative hypotheses are 

Ho: Hy — Ho < 2, and Hi: ui- [ly > 2 (one-sided) 
= 0.05. 


Since the populations are non-normal 
and the sample si ' 
large, therefore the test-statistic under Ho is aiia 


= (X, -8-2 


(ii) The significance level is chosen ata 


(iii) 


ny 
which is approximately standard normal. 


The critical region is Z > 1.645 as fora = 0.05, 20.05 = 
Computations: The sample values are given to be l 


(iv) 1:645, 


(v) 
2 
70.4, Si = 31.40, 


nı = 40, xX, 


2 
65.3, S3 


Ng = 50, Z3 = 44.82. 


oe (70.4 — 65.3) — 2 3.1 


31.40 44.82 0.785 + 0.8964 
40 * 50 l 


1 


8. 

1.30 
Conclusion. We see that z 2.34 falls in the critical region. 
Therefore, we reject our null hypothesis Hg: jt, — Ho < 2 and 
accept Hy: [ly — [ly > 2. 

16.2.5. Testing Hypothesis about a Population Proportion 


= = 2.34, 


| when sample size is large. Let P be the proportion of success in a 
m of size n drawn from a binomial population having proportion p. 
the sample size is sufficiently large, then P will be approximately ` 


normally distributed with a mean p and a standard deviation (% 


=P. In other words, if the sample is large, then the variable 
A 


P=p 


\Vpq/n 


*pPtoximately standard normal. 


INTRODUCTION TO STATISTICAL THEORY 


150 
P x where X is the actual number of successes in n 
When P = 7 > l | 
; r F 
andom sample, the standardised vat iable become 
ran ; 
x_ 
n P X=np 


i} 


Be pg npq 
n 


test the null hypothesis that the population 
ified value po. If the hypothesis Ho: p = Po is 
AN 


P-po X-Mpo is approximately standard 


n 
tatistic for testing Ho : P = Po. The 


Suppose we wish to 
proportion p has a spec 


true, then the variable 


normal, and is used as a test-s 

decision procedure is stated below: 
(i) Formulate the null hypothesis and the appropriate alternative 

hypothesis. ‘ 


Choose the significance level of size Ql. 
The test-statistic, if the null hypothesis Hy:p =P is true, will be 


(without continuity correction) 


Po siete i 
2 (with continuity correction) 


= ’ 


NPoTo 
For large n, Z is approximately standard normal. 


(iv) The critical regions are established as under: 


(a) When H; is p # po, the critical region is Z < —Zq/2 and 
Z> Zaj à 


(b) When H; isp > po, the critical region is Z > Z4. 
(c) When Hisp < Po the critical region is Z < —z,, - 
(v) Compute the value of Z from the sample data. 
(vi) Decide as below: 
Reject Ho if z falls in the critical region. 


Accept Ho otherwise, 


151 


400 times and it turns up head 
be an unbiased one, 


We consider the null hypothezis that the coin is unbiased, i.e 


| STATISTICAL INFERENCE: HYFOTHESIS TESTING 


Example 16.14. A coin is tossed 
216 times. Discuss whether the coin may 


| (i) 
Ho:p = l against the alternati 
z alternat i 
ive hypothesis Hp #0.5. 
l (ii) We choose the significance level ata = 0.05 
i (iii) The test-statistic to be used under Hp, is 
0 
1 
kt) -n 
Z= a: 


VP 099 ý 


| which is approximately standard normal, 
| 


(with continuity correction) 


üv) The critical region is |Z| > 1.96. 


(v) We compute the value of Z as below: 


= Pies 1 
_ (216 ~ 5) ~ 400 x 


zac 8 wE 
i ajsooxix} 07 a 
| 252 . 


. We have used x — 1/2 asx > npo. 


(vi) Conclusion, Since the computed value z = 1.55 is less than 


9 à f $ 
1. 6, we accept the hypothesis, and may con d 
C clu e that the coln 1s 


oo baer In an experiment to decide whether butter and 
l aan te e istinguished, 90 individuals were each given three 
| oe a © containing butter and the’ other one containing 
Ain: e. Iney were then asked which one of the three contains 
—tgarine, and 38 correct identifications were made Test the ` 
Significance of the test. (P.U M Sc. 1971) 


(i) We formulate our hypotheses as 


| XY: d AE E 
o'ps 3 and the individuals have no power of identification, 


| l.e., the results are due to chance. 


’ 


Hy: 4 
1:P >>. 


We choose the level of significance at a = 0.05 


(iii i = 
i) Thet ‘-statistic to be sued under Hp, is 


CTION TO STATISTICAL THECRY 


INTRODU! 
452 , 
Z X= Mo (without continuity correction) 
i inp o%o 
atp- Wo (with continuity, correction) 
——" 5 
ce rd ormal 
which is approximately a standard n N 
‘tical region is > 1.645 because the alternative 
i critica e 
ja ipoteci js stated on & greater than basis. 
(v) Wethen compute the value of Z as below: 


1 
38 - 90X35 > 8__ 4,79. (without continuity correction) 


— 


z= 2 4.41 
90xX5%3 


1 1 
= 00 XF P 4 
(38-2) 03 = 1.68, (with continuity correction) 


ii 90 X2%5 

n. Since the calculated valu 
so we reject Ho : P 
ntly different from chance 


(vi) Conclusio e of Z falls in the critical 
region in either case, 
that the results are significa 
identification. 

16.2.6. Testing Hypothesis 
Proportions. Suppose we wish to tes 

_ between two proportions is equal to a speci 
proportions are equal. The statistic on which w 
the variable P} — Po, where P, is the proportion of success in the first 
sample of size n, and P, is the proportion of success in the second 
sample of size ng, samples are drawn from two binomial populations 
with unknown proportion of success p; and po respectively. If the 
gamples are sufficiently large, the sampling distribution of the difference 
P, — Pz is approximately normal with a mean of p; — pz and a standard 


TA pa Pode 
deviation of My + te That is, for sufficiently large sample sizes, 


the variable 


1 
SF and may conclude 


about Difference between Two 
t the hypothesis that the difference 
fied value Ay or that the two 
e base our decision rule is 


Z= (Pi -P}) - (p; — Pg) 
p 
Pigi | Pade 
ny Ng 


is eppreximately standard normal 


STATISTICAL INFERENCE: HYPOTHESIS TESTI 
NG 


> 


The standardi ri ns Py a 
ardized variable then be Portions p} and Py respectively 


comes 
A A 
(P= 
Z 1 Po) - p, — Po) 
NA 
P19) Paf 
nı No 


and the test-statistic, i 
Stic, if the hypothesis Ao:p,- 
` 70 "P1 “P2 = Âg is true, will be 


id= hy 


Z= 
NA 
P19, Ped 
+ 
ny No 


If we wiel to test the hypothesis H, - 
aoe proportion) is vapisi s 
a 5 
. | by taking a weighted phi f 
portions p, and p, as follows: i 


P2 =p, then p (the common 
its estimate Po which is 
the two observed sample 


: a’ 

A n + A 

Pe= Pi —2P2 
ny + No 


The z value fo i 
E 
testing the hypothesis Ay: p, = po then b 
: Pı = Po then becomes 


A A 
Pen 21 Ps 


an fl : 
l Vea Gra) 
he proced y i ' 
7 ure for testing the hypothesis Ho: Py = Do is gi 
iani : Pı = Do is given below: 
e null and the appropriate alternative hypothese 
s. 


Gi) Deci 
brs ecide upon the significance level of size q 
The test-statistic under Hg is | 


AN AN 


which, fo ‘i i 
i ae r large sample sizes, is approximately standard normal 
critical regions are established as under: | 


(a) When H, i 
Lisp, - aig PANE" 
z en 1 ~P2 Æ 0, the critical region is Z < 2g and 


en 0: > z 
1S Dy P2 > 0, the critical regi n is 
1 oa * 


TRODUCTION TO STATISTICAL THEORY 
i IN i 


154 ` po < 0, the critical region is Z < —z, . 
W! H i _ 2 A 
i and Z from mple data. 
_ ute the value of Z fro the samp 
(v) Comp 
(vi) Decide as below: 


ic n. 
j tH ifz falls in the critical regl 
Rejec 0 9 


ise. DO 

Axcept Ho, otherwis ndom sample of 500 men thar Lahore city, 

Example 16.16. rs : a one of 1000 men from Karachi city, 559 
be smokers. 


iti re significant] 
300 are found to hat the two cities are sig y 


indicate t 2 
he data indica ing among men? 
sie ame ne to the prevalence of smoking : 
different wi 


We formulate our hypotheses as 


° i.e, there is no difference between the proportions 
Ho: Py = Py Le 


of smokers; 
Hy :p,#Po- 
(ii) We choose the significance level at & = 0.05. 
(iii) The test-statistic under Hp, is 


A A 


Pi7P2, 
"af a 
Pele ny + n 
where p 1 = proportion of smokers in the city of Lahore, 


Pa = proportion of smokers in the city of Karachi, 


p = an estimate of the common population pe are 
° on the assumption that the two cities are alike ie 
respect to the prevalence of smoking among men, 


Le. 
A 
NIPI + ngpa A m 
P, = l and te=1-p, 
ny + Ny 


n , . lard 
The statistic Z, for large sample Sizes, is approximately standar 
normal. 


(iv) The critical region is |Z] > 1.96. 
(v) Computations: Here Pi = om = 0.60, Po = fon = 0.55, and 


S = 300+ 550 gs0 "i 
= = = .433. 
Pe 500+ i009 1500 = 9-567, so that ĝ, = 0 


- 0.05 0.05 
” Torr SA a 
(0.2455) (0.003) ` 0.027 = 1-85. 
(vi) Conclusion. Since the calcul 
the critical re 


and conclude 


t the data indi > 
no cina gata do not appear to indicate that the two 
cities are different with respect to the Prevalence of smoking 
amon 


t 


al ght 
< A candidate for] mayer in a lar 


3 ated value z = 1.85 does not fall in 
gion, so we accept the null h othesis H,:p. = 
the f yp 0: Pi 


'Be city believes 


Let Pı = proportion of women Voters, and 


Po = Proportion of men voters, 


Then we make H, the hypothesis wh 


at the candidate. for mayor 
believes as true and its negation as Ho. Thus 


(i) The null and alternative hypotheses are 
Ao: p, ~P2 S 0.10, and 
H, ‘P1 Po > 0.10, 


(negation of candidate’s claim) 
(candidate believes as true) 
(ii) Weare given the significance level at & = 0.05, 

(iii) The test-statistic under Hp, is 


i ny 
which, for large sample sizes, is approximately standard normal, 
(iv) The critical region is Z > zy. = 1.645 


^ 62 A _ 
(v) Computations: Here P1 = 799 = 9-62, so that 9, = 0.38, 


69 a 
= Ti = 0.46, so that g, = 0.54, 
P2 = 150 = 0.46 12 


156 INTRODUCTION TO STATISTICAL THEORY 


(0.62 — 0.46) — 0.10 
Thus z= 
(0.62) (0.38) a. (0.46) (0.54) 
100 150 


0.06 0.06 _ 0.95. 


~ 4f0.002356 + 0.001656 0-063 


Conclusion. Since the calculated value z = 0.95 does not fall in 
the critical region, so we accept the null hypothesis 


16.2.7. Testing Hypotheses about Standard Deviation: Large 
Samples. Suppose that we wish to test the null hypothesis that the 


standard deviation © of a normal population has a specified value Og. For 
a sufficiently large sample of size n, drawn from a normal distribution 


with mean p and standard deviation O, it is found that the standard 

error of the sample standard deviation S, is ts . The hypothesis 
n 

Ho :© = Op can then be tested by computing the following test-statistic 


. Z= S-o 
Oo / [2n 
which, if Ho : © = Go is true, is approximately N(0. 1). 
In case of two large samples of sizes ny and no, each drawn 
independently from a normal population, the standard error of the 


(vi) 


difference S, — S% is 
20 Gg? 
1 2 


2n; -2ng 
It is found that for large sample sizes, the sampling distribution of the 
difference S, — Sz is approximately normal with a mean of O} — Og anda 
2 2 
_ 1 2 z 
standard deviation of In + N Thus the variable 


(S; — So) TS (o; a Oo) 


z 2 2 
i a 
2n, 2n 


~is approximately S 
true, would be 


A naaa rh et es Sea = or 
a 


tandard normal. The test-statistic, if Hy : 0, = 92 is 


2 ’ 
2 

Se 

2n, Qn» 


which is approximately N(0, 1), ° 


The sample standard deviations can b 
e 


standard deviations when they are not kiwa Substituted for population 


The credibility of the 
null hypothesi 
ienei pothesis H, : = 
h le tanner. It should be noted that thi i D F2 may be tested in 
the populations are not normal, "S Peale ia not used when 


Example 16.18. A 
i iii random sa 
opulat mn 
population showed a standard deviation of 8.9 Test at the 0.0 
ane e 0.05 level of 


(i) The null and alternative hypotheses are 
Hy:0=% 
0 1.5 and H,:o# i 
, 1: 7.5. (two-tail 
k =s are Hi the significance level at a = 0.05 Hi 
in i 3 iaat . . 
ce the sample size (n=100) is sufficiently large and the 


sample is drawn from a 
normal i 
under Ho, would be population, so the test statistic 


s- S= 
a Og / V2n : 
l which is approximately N(0, 1). 
_ (iv) The critical region is |Z| 2 1.96. 
(v) Computations: We are given S = 8.9, Oy = 7.5 and n=100 
Ja 8.9-7.5 e (1.4) (14.14) 
7 7.5 / (200 7.5 
vi) Conclusion. Since the calculated value z=2.64 falls in the 
critical region, so we reject the null hypothesis Hy: 6 = 7.5 in 
favour of the alternative H,:0#7.5. 


devi ne 16.19. The following table gives means and standard 
vn ations of marks obtained by the candidates in an examination held at 
centres A and B. Assume the marks follow a normal distribution. 


Centre No. of candidates Standard 
Deviation 
A 1275 44.8 8.3 
I 2346 47.3 6.5 
S the difference between the standard deviations significant? ; 
(P.U., M.A. Econ., 1970) 


= 2.64. 


ODUCTION TO STATISTICAL THEORY 
INTR 


(0.62 — 0.46) — 0.10 


Thus z (oen 038 Oy 
6 


.0 


= 0.063 
= f0.002356 + 0.001656 


= 0.95 does not fall in 
P : Iculated value z : 
(vi) ETU ond a we accept the null hypothesis 
the criti r , 


Hop 17 280.10. ses Standard Deviation: Large 


: theses a $ } 
ee sy “oral wish to test the null hypothesis that the 
Samples. Suppose 


ndari eviation i For 
stand. d d ti oO ion has a specified value Oo. 

as of a normal populati A : k ; 
ffici ute sample of size n, drawn from a nor mal distr but (0) 
a sullicien' n 


with m O, it i tandard 
i u and standard deviation O, it 1s found that the s 
Ay iation S, is -== . The hypothesis 
error of the sample standard deviation S, Jan 5- 
Ho: 0 = Où can then be tested by computing the following test-statistic 

0 A = 0 p 


S-00 
Ge ; 
which, if Hy: © = Op is true, is approximately N(0. 1). 
In case of two large samples of sizes n, and No, each eyin 
independently from a normal population, the standard error of the 
difference S} — S; is 


2 2 
2n; 2n3 


It is found that for large sample sizes, the sampling distribution of the 
difference S, — S; is approximately normal with a mean of O} — Gy and a 


ne oo 
standard deviation of \ /— + >~ Thus the variable 
2n, 2ng 


Z- (Si -82 - (0, - 03) 
A T la a 


2 
oj o? 
2 
2n, * Be, 
ni 2ng 


> is approximately standard 
true, would be 


normal. The test-statistic, if Ho : 0) = Og is © 


which is approximately N(0, 1), ° 


The sample standard deviations ¢ 


a an be substi 3 " 
standard deviations whe stituted for population 


n they are not known. 


The credibility of the null hypothesi 
the usual manner. It should be noted th 
the populations are not normal, 

` Example 16.18. A random sam 
population showed a standard deviation 
significance the hypothesis that o = 
O# 7.5. 

(i) The null and alternative hypotheses are 


ple of size 100 from a normal 
of 8.9. Test at the 0.05 level of 


7.5 against the alternative that 


H:0 = 4.5 and Hy: 0#7.5, (two-tailed) 
(ii) Weare given the significance level ata = 0.05, 
(iii) Since the sample size (n=100) 
sample is drawn from a nor 
under Ho, would be 


00) is sufficiently large and the 
mal Population, so the test statistic 


S-o 
Z=——2_ 
Oo / \2n 


which is approximately N(0, 1), 
(iv) The critical region is |Z| > 1.96, 
. (v) Computations: We are given § = 8.9, Oy = 7.5 and n=100. 
8.9 — 7.5 z (1.4) (14.14) = 2.64. 
7.5 / 4/200 7.5 
(vi) Conclusion. Since the calculated value 2=2.64 falls in the 
critical region, so we reject the null hypothesis Hy: 6 = 7.5 in 
favour of the alternative Hy :0#7.5. 


Example 16.19. The following table gives means and standard 
deviations of marks obtained by the candidates in an examination held at 
0 centres A and B. Assume the marks follow a normal distribution. 


Standard 
Deviation 


(P.U., M.A. Econ., 1970) 


Z. = 


INTRODUCTION TO STATISTICAL THEORY 


158 


i t up our hypotheses as ; 
(i) We set up ere is no difference between the standard 


Ho: 0} = Oo Le. th 
deviations, and 


Hy: 0,409. 
= 0.05. 
(ii) We specify the significance level at a = 0 
ng Teer ee: wis 
Gii) The test-statistic, if Hy :O} = Og is true. 
Sı- S2 
Z= ; 
2 2 
A 
2n; 2na 


where the sample standard deviations are substituted for 
population standard deviations as they are not known. The 


variable Z is approximately standard normal. 


(iv) The critical region is |Z| > 1.96. 
(v) We compute the value of Z by substituting the sample values as 


8.3 - 6.5 


= Taa 65 
2(1275) * 2(2346) 


1.8 1.8 


= q = — = 9.47, 
0.0270 + 0.0090 0.19 


(vi) Conclusion. Since the calculated value z = 9.47 falls in the 
critical region, so we reject the hypothesis Hy : 0; = Og and 
conclude that the difference between the standard deviations is 
significant. 

16.2.8, Relationship Between Confidence Interval and Tests 
of Hypothesis. There is a close relationship between the confidence 
interval for a parameter 0 and a test of hypothesis about ©. Let [L, UJ be 
a 100(1-a)% confidence interval for the parameter O. Then we will 
accept the null hypothesis Hy: 8 = 0, against H, : O # Oy at a level of 
significance a if 0, falls inside the confidence interval, but if 0, falls 


outside the interval [L, U), we will reject Ho. In the language of 
hypothesis testing, the (1-a) 100% confidence interval is known as the 
acceptance region and the region outside the confidence interval is called 
the rejection or critical region. The critical values are the end points of 


successes and failures, Occurrence ang fi 
defective, and so forth. The units of as 

i i i i ‘ample. drawn fro 
population will fall into either one class or the other of the fee Henke 
classes. If the proportion of one class (say, Successes) is p, then the 


the proportion of Successes in g binomial Population. is equal to a 


specified value, i.e. Ho : P = po, where Po is the specified value of P, the 
parameter of the binomia] distribution, i 


The procedure for testing Hy :p = Po when sample size is small, is 
as follows: 


(i) Formulate the null hypothesis as Hy : p = Po, With an 
appropriate alternative hypothesis about p. 


(ii) Take ine significance level at & = 0.05 or 0.01. It may be 


impossible to set Q at exactly 0.05 or 0.01 as the test-statistic is a 
discrete distribution. : 


(iii) The test-statistic is X, the number of successes in n trials, i.e. the 
test statistic is the binomial random variable X. 


(iv) The rejection region will consist of all values of X whose 
Probabilities (areas) are equal to or less than the significance 
level œ. In case of one-tailed test, the probabilities (areas) in the 


desired tail are added til! we reach the significance level a. In 
case of two-tailed test, the probabilities are added from both the 


tails in such a way that the sum is equal to or less than a and 
half sum comes from each tail. 

(vV) Find x, the number of successes. To make the computations 
easier, the probabilities are shown either graphically or by 
cumulative or decumulative columns. 


(vi) Decide as below: 


Reject the hypothesis Ap, if x falls in the critical region. 
Accept Ho, otherwise. 


460 INTRODUCTION TO STATISTICAL THEORY 
Example 16.20. A coin is tossed 8 times and comes up heads 7 
times. Can we conclude that the coin is fair at a significance level of 0.05? 


(i) Let p denote the probability of heads in a si 
Then our null hypothesis that the coin is fair, 


as 


single toss of the coin. 
will be formulated 


Ho: p = 0.5 
and the alternative hypothesis would be H, 
The significance level is approximately 0.05. 


:p # 0.5. 


(ii) 
(iii) The test-statistic to be used is x, the number of heads. 
(iv) Critical Region. First we compute the probabilities associated 
with X, the number of heads, by using the binomial distribution 
P(X=x) = (Porat. Under Hy: p = i, 
8) (1 = 1 8-x 8 1 x 
rao OTO -O@ 


The probabilities of 0, 1, 2, ..., 7, and 8 heads are given below: 


Probability Cumu- Decumu- 
x P(X=x) lative lative 
0 a acd 


gTATiSTICAL INFERENCE: HYPOTHESIS Testing 


16.4 


. We use the cumulatiy v 
5 tative column and the decumulati rs 
ative column as 
the 


critical region is composed of two portions of a 
gach tail of the distribution If & = Q.oz then py eck k 
ach ; %/2 = 0.025 (area in each 
We observe that P(X<1) = 0.0351 > 9 025, | 
. P(X27) =0.0351 > 0.025 
Therefore the true significance level is 


» and 


a= PX< 
(X <0) + PIX 28) = 0.0039 + 0.0039 = 0.0075 
Hence the critical region is X < 0 and X > 8 iii 


(v) Computation: x = 7 


- (vi) Decision. Si =7 i 
' (vi) Decision. Since x=7 does not fall in the critical region f 
, SO W2 


accept our null hypo:hesis Hy: p = 0.5, and conclud that th 
ma e that the 


coin is fair. 
EXERCISES 
‘16.1 Define the followin i R 
| or 8 concepts in your own words as fully as you 
ka Hypothesis Testing (ii) Statistical Hypothesis 
Gii)> Null Hypothesis (iv) Critical Region. 


(P.U., B.A/B.Sc. 1980) 


BS meieni ; 
5.2 Explain with examples the difference between 


` (iv) Level of Significance ` 


(i) Null Hypothesis and Alternative Hypothesis, 
(ii) Simple Hypothesis and Composite Hypothesis. 
Gii Ac.eptance Region and Rejection Region. 
Cv) Type I-Error and Type Il-Error. 


(V) One-tailed Test and Two-tailed Test. 
(P.U., B.A/B.Se. 1979, 83, 91, 93) 


3 Pertin o 
Xplain what is meant by (i) a statistical hypothesis, (ii) test- 


S tet; wee 
rae (iii) the power of a test, (iv) significance level, (v) test of 
8nificance, and (vi) operating characteristic function. 


(a) Distinguish between any two of the following concepts: 
(i) Statistical Estimation and Hypothesis Testing. 
Gti Type I and Type II Errors. . 


ii” Rejection and Non-rejection Regions. 


162 


16.5 


16.6 


16.7 


16.8 


INTRODUCTION TO STATISTICAL THEORY 


‘(iv) A Test ata level of significance and 1-a Confidence 
iv 


Interval. 


sed! “Aras > 
i (b) How is the Type I-error related to the Type II error? Are type 


= 1? (P.U., B.A/B.Sc. 1986-8) 
I and type II errors such that & + B = 1? (P.U., B.A/B.Sc 


i i son or convict an innocent 

erent sa pel m Soham ie student or may fail a 
Lasers "Discuss the relevance of the concept of the two 
at errors in these two case. Give other examples Ha l 

ig ‘tion of families buying milk from company Aina 
3 ne i believed to be p =0.6. Ifa random ebay 10 

families shows that 3 or less buy milk from ee cia 

shall reject the hypothesis that p=0.6 in favour o e 

alternative p<0.6. Evaluate & if p=0.6. Evaluate B to the 

alternatives p=0.3, p=0.4 andp=C.5. (P.U:, B.A/B.Sc: 1993) 


Define Type-I and Type-II errors in testing hypotheses. A 

normal distribution is known to have a variance of 2.8. A 

oné-tailed (increase) test is proposed of the form Hy: 4 < 14 

versus H, : u > 14. Find the probability of making a Type-II 

error (B) with a sample size 2 shap Sen Aipa 
Tas EA ed 

i (i) 0.05, (ii) 0.01, when 1e wi repete 

(b) Given Ho: u > 200, H, : |t-< 200, n = 100, a = 0.023, and 
o = 25. a 

(G) For what values of the sample mean & will Hy, be accepted? 

(ii) Compute B if jt is actually 191. - (iii) What is the power of 

the test in (ii)? What does it mean? 

(a) Explain how the null hypothesis and the alternative 
hypothesis are formulated. 

(b) At. exercise physiologist wants to demonstrate that the 
average person walks more than 800 km per year. State the 
null and alternative hypotheses. What do we use test- 
Statistic? : 

(c) Describe the general procadure for testing a hypothesis about 
a Population parameter. = . 
Based on a sample of 25 observations fron. a normal population 
with O = 3, the hypothesis Ho: H = 67 against Hf, : p > 67 is 


(a) 


tested at 5% level of significance. Compute the probabilities of 


committing type-ii errors, B and the powers of the test, when 
alternative hypotheses of 68.5, 68.0, 67.5 and 66 are used. g 
(P.U, B.: /B.Sc. 1993? 


STATIS ft ;AL INFERENCE: HYPOTHE 
So 1 EAE: HYPOTHE: 


16.9 


` the expression 


16.10 


16.11 


16.12 


16.13 


IS TESTING : 46% 
. = = ih ee s 
Given Ho: u = ne; SER 
t saih a Mo and Hiep iy anda and B are probabilities 
one-sided h tence I errors respectively, show that for a 
YPothesis est, the required sample size n is given by 


TA Satapa ' 
H = Ho)? 
Also use this formula to ! 


find n when GO=12, Ho=28, 1, =32 - 
a@=0.05 and B=0.01. . nen 


The hypothesis Hy: hi = 100 is to be tested with a=0.05. The 


population standard deviation is known to be o=10, 


(a) Would a Sample of size n=100 result in a value of B less 
than 0.2 if, in fact, u = 110? ` 


(b) How large a sample would be required so that B = 0.01 if, 
in fact, [t = 110? (P.U., M.Sc., 1972) 


Suppose that Hy : u = 200 miles and H) : H > 200 miles. An 
Q&=0.05 is required and B=0.10 is acceptable when the true mean 
is 205 miles. Find the optimum sample size. It is estimated that 
O= 15. What decision rule would you establish? l 


(a) What statistical hypotheses can be tested by means of the 
normal distribution? - 


(b) Past experience has shown that the scores of students who 
take a certain mathematics test are normally distributed with 
‘mean 75 and variance 36. The Mathematics Department 
members would like to know whether this year’s group of 16 
students is typical. They decide to test the hypot'^esis that 
this year’s students are typical against the alte-native that 
they are not typical. When the students tak> the test, the 
average score is 82. What cenclusion should be drawn? 


(a) What is the difference between a one-sided test ‘and a two- 
sided test? When should each be used? . 


= (b) A random sample of size 36 is taken from a normal 


population with a known variance 0? = 25. If the mean of the 
sample is ¥ = 42.6, test the null hypothesis |'=45 against the 
alternative hypothesis jt < 45 with a=0.05, (a is the 
. Probability of committing Type T-error).  (P.U., B-A/R,St:1978) 


164 L INTRODUCTION TO STATISTICAL THEORY 


14 The heights of college mal : 
i normally distributed with a mean-of 67.39 inches and O= 1.30 
` inches. A random sample of 200 students showed a mean 
a (.05 significance level, test the 


height of 67.47 inches. Using 
t the alternative H :|>67.39. 


. hypothesis Hop =67.39 agains 
aÍ The IQ’s of the college students are known to be normally 
distributed with a mean of 123. A random sample of 49, 


students showed an average IQ of x = : 
Test the hypothesis that jt = 123 against the alternative that 


it is less. Let & = 0.05. 


6.15 (a) Asample of size 40 from a non-normal population yielded the 
sample mean 7 = 71 and S2 = 200. Test Hg: jt = 72 against 
7”_-s = 
H, : p #72. using a 0.01 significance level. 


' (b) Suppose that the mean jt of a random variable X is unknown 
' but the variance for X is known to be 144. Should we reject 


the null hypothesis Hg : }t = 15 in favour of an alternative 


hypothesis H, : jt # 15 at a = 0.05, if a random sample of 64 
observations yields a. mear. x = 12? (P.U., B.A/B.Sc. 1985) 


` 1646 It is claimed that an automobile is driven on the average more 
than 20,000 kilometers per year. To test this claim, a random 

sample of 100 automobile owners are asked to keep a record of 

the kilometers they travel. Would you agree with the claim if the 

random sample showed an average cf 23,500 kilometers and a 

. standard deviation of 3900 kilometers? Use a 0.01 level of 
significance, (P.U., B.A/B.Sc. 1979) 


16.17 US A sample of 900 members has a mean 2.4 inches. Could it be l 


reasonably regarded as being a simple random sampie from a 
large popuiation whose mean is 2.9 inches and standard 
deviation 3.2 inches? ` 


(b) A sample of size 400 has ¥ = 6.0". Can it be regarded as 1 ~ 


‘simple random sample from a large popy:aticn with mean 
6.2" and standard deviation 2.25"? f 


16.18 (a) A process-is in control when the average :.mourt, of instant 
coffee that is packed in a jar, is 6 oz. The s.andavd deviation 
is 0.2 oz. A sample of 100 jars'is selected at xandom and the 
sample average is found to be 6.1 oz. Is the process out of 
control? 


e students are known to be ` 


120.67 and S = 8.44.39 


STATISTICAL INFERENCE: HYPOTHESIS TESTING 165 
a 
cee. 


SS i 


(b) Can you reject a clai 


Parlianfent is at leas m that the average age of members of 


t 50, if a random 

ae ) sample of 36 members 

l Folenn age of 48:7 with a standard deviation of 3.1 years. 
x members’ ages are normally distributed; test at ` 


Fay the 0.01 level. 
io ao size 6 from a normal population with variance 24 
. Bave x; = 15. A sample of size 8 from a normal population 


with variance 80 gave ¥, = 13, Test Ho: Hy — Hy = 0 against 
m~ Hi: [ly — hg #0. Let ao = 0.05. 


F Gos random sample of size n, = 25, taken from a normal 


population with a standard deviation O; = 5.2, has a mean 
al = oh. A second random sample of size n,=36, taken from a 
ifferent hormal population with a standard deviation 


Og = 3.4, has a mean 7, = 76. Test the hypothesis at the 0.06 
level of significance that 1, = [lg against the alternative 
Hi x blo. 


16.20 The two samples A and B detailed below-were taken from normal 


populations of standard deviation 2.5. Decide whether the 
difference of sample means is significant at the 0.05 level of 
significance. i 


16.21. An examination was given to two classes of 40 and 50 students 
respectively. In the first class, mean grade was 74 with standard . 
deviation of 8, while in the second class the mean grade was 78 
with a standard deviation of 7. Is there a significant difference 
between mean grades (i) at 5% level of significance? (ii) at 1% 

. eyel of significance? (P.U., B.A/B.Sc. 1992) 


16.22 - A manufacturer suspects a difference in the quality of the spare 
parts he receives from two suppliers. He obtains the following 
data on the service life of random samples of parts from two 

' suypliers. 


Number in Sample 
50 ae 150 
100 153 


Test whether the difference between the two sample means is 
statistically significant at the 1% level of significance. 


Standard 
Deviation 


Supplier 


| ; 
SEa l E Sem. Fst 
; 3 ' ` STATISTICAL INFERENCE: Hyp : 
156 ` INTRODUCTION TO STATISTICA. THEORY ` Ee oe es 
ak deh h N es 16.28 (a) In the inspection of a product ij a. 


6.23 (a) A simple sample of heigj lishnien has a mean . aa 
! of 67.85 nches and a bee aad 2.56 inches, while ie sea 200. units, 12 are defective Is this consistent with 
a simple sam 1. of heights of 1600 Australiens as é tiewn. ae s ; P i erage ShG per cent set as a standard? : > 
€3.55 inches and a standard deviation of 2.52 inches. Dc the ._ -` ŒA sample of size 78 from a binomial population gave 35 
data indicate that Australians are on-the average tailer than ©; +. ` |, Slecesses. ‘Test the null hypothesis that the true Proportion 
Englishmen? Use cc=0.05. -© (P.U,, B.A/B.Sc. 1996): | , BET of successes is 0.55 against the alternative that it is less., 
(b) A potential buyer of light bulbs bought 50 bulbs of ~~~? of : RA Let n a . 
i 2 brands. Upon testing the bulbs, he. e- > ~ -at Orand A hau 16.29 (a): The manufacturer of a patent medicine claimed that it was 
a mean life of 1282 aure -zi a standard deviation of 80. : / "90% :effective in relieving an allergy for a period of 8 hours, ‘In 
hours, whereas brand B had a mean life of 1208 hours vith a: a sample of 200 people who had the allergy, the medicine 
standard deviation of 94 hours. Can the buyer be quite 5 . provided relief for 160 people. Determine whether the 
certain that the two brands do differ in quality? ` ; manufacturer’s claim is legitimate at the a = 0.01 level. 
16.24 A random sample of 80 light bulbs manufactured by company A i _(b) It is claimed that 90% of men cannot tell,the difference 
` hadan average life time of 1258 hours with a standard deviation between two different brands of cheese, but of the members 
of 94 hours, while a random sample of 60 light bulhs ; of a random sample of 500 men, 72 could distinguish between 
manufactured by company B had an average lifetime of 1029 y them. Is the claim justified? (I.U., M.Sc. 1995) 


hours with a standard deviation of.68 hours. Because of the high 


j k Se ; An electrical company claimed that at least 95% of the parts 
cost of bulbs from company A, wë are inclined to buy from. ‘ 


which they supplied on a government contract conformed to, 


company B unless the bulbs from company 4 will last over 200 ` O specifications. A sample of 409 parts was tested, and5 did not” 
hours longer on the average than those from company B, Run a : meet specifications. Can we accept the company’s claim at a 0.05 j 
test using &=0.01 to determine from whom we should buy our level of significance? x gated tont ; 
bulbs. - . = (P.U., B:A/B.Sc. 1991) . 16.3 A random sample of 150 light bulbs manufactured by a firm 


X showed 12 defective bulbs while a random sample of 100% 


16.25 (a) Explain how you test the hypotheses on Proportions, : n 
light bulbs manufactured by another firm Y showed 4 


(b) A basketball player has hit on 60% of his shots from the floor. defective bulbs. Is ther pa sore h 
If on the next 100 shots he makes 70 baskets, would you say Š ec wes ulbs, Is aa significant difference between the. 
that his shooting has improved? Use a 0.05 level of propornonsiof OMEN Hee Bibi Be, 1., 
n, i significance, (P.U., B.A/B.Sc. 1978, 81) s O; andoin samples of 500 men and 500 women are selected to 
», 16.26 (a) A coin is't ; > , determine whether the proportions of men and women 
b -oin is tossed 900 times and heads appear 490 times. Does : favouring a political candidate are different. Perform a 
this result support the hypothesis tlat the coin is unbiased? hypothesis test at 5 percent level if, in the samples, 225 men 
(b) The sex distribution of 98 births reported in a newspaper was and 275 women favour the candidate. What is implied by the 
, jis boys and 46 girls, Is this consistent with an equal sex test result? 
a moii in the population? l 16.32 @) A machine puts out 16 imperfect articles in a sampl2 of 500. 
16.27 In a poll of 10,000 voters selected at random from al! the voters After machine is overhauled, it puts out 3 imperfect articles. 
wn a certain district, it is found that 5,180 voters are in favour of in a batch of 100. Has the machine been improved? Two iaie 
a particular candidate. Test the null hypothesis that the (b) A manufacturer of house-dresses sent out advertising by 
Proportion of all the voters in the district, who favour the mail. He sent samples of material to each of 2 groups of 1,000. 
candidate 18 equal to or less than 50% against the alternative that Women For one group, he enclosed a white return envelope 
it is greater than 50%, Use a 0.05 level of significance. , ` i and tor the other group, a blue envelope. He received orders 


from 10% and 13% respectively. Do the data indicate that the 


colour of the envelope has an effect on the sales? Use 5% level 
(P.U., B.A/B.Sc. 1996) 


(P.U., B.A/B.Sc. 1980) 


of significance. 


168 INTRODUCTION TO STATISTICAL THEORY 
Se INTRODUCTION To STATISTICAL THEORY 


16.33 A civil service examination is given to a group of 200 candidates. 


Cr. the basis of their total Scores, the 200 candidates are divided 
into two groups, the upper 30 per cent and the remaining 70 per 
cent. Consider the first question on the examination. Among the 
first group, 40 had the correct answer, whereas among the 
second group, 80 had the éorrect answer. On the basis of these 
results, can one conclude that the first question is no good at 
discriminating ability of the type being examined here? 


` ` A _ 40 A 0 A 40 + 80 
Hint. Here Py = G0? P2 ace; and Pe = 604 140 
so that z = ——0:67=0.57 __ 1.32 < 2g 095 - 

1 1 pan 
0.6) (0.4) | — + — - 
\ (0.6) (0.4) ae 


Accept the hypothesis and conclude that the first question is not . 
Satisfactory, etc., ` 


The standard deviation of a simple sample of 1,000 -members is 
5.9 years and that of an ‘ndependent sample of 900 members is 
6.1 years. Show that the samples can be reasonably regarded as 
drawn from equally variable normal Populations. (LU, M.Sc., 1993) 


A coin is tossed 10 times and comes up heads 8 times. Can we 
conclude that the coin is fair ata significance level of 0.05. 


Me % 0% 06 0 < 
o%e b 3 b 
ot 030 80 80 ote ate t oe oso oo 


17 


The Chi-Square Distribution 
and Statistical Inference 


17.1 INTRODUCTION 


Another distribution that has many important applications in 
statistical inference, is the % ?-distribution (X is the Greek letter chi, 
pronounced ki as in kite). The chi-square distribution was first obtained 
in 1875 by F.R. Helmert (1842-1917), a German Physicist. Later in 
1900, Karl Pearson (1857-1936) showed that as n approaches infinity, a 
discrete multinomial distribution may be transformed and made to 
approach a chi-square distribution. This approximation has broad 
applications such as a test of goodness of fit, as a test of independence 
and as a test of homogeneity. The chi-square distribution contains only 
one parameter, called the number of degrees of freedom (df), where the 
term degree of freedom represents the number of independent random 
variables that express the chi-square. If the random variables entering a 
chi-square are subjected to linear restrictions, then the number of 
degrees of freedom is reduced by the number of restrictions involved. 
Later we shal: find that the number of degrees of freedom is given as the 
total number of observations in a sample minus the number of 
population parameters that must be estimated from the sample data. 


17.2 THE CHI-SQUARE (x?) DISTRIBUTION 
Let Z, Za, s Za be normally and independently distributed : 


‘ variables with zero means and unit variances. Then a random variable 


expressed by the quantity. 


o of 
L= Zi 
` i=] 


169 


170 is r INTROD'JCTION TO STATISTICAL THEORY 


THE CHI-SQUARE DISTRIBUTIO™ AND STATISTICAL INFERENCE 171 
; eg e a 
is defined as a chi-square random variable with n degrees of -a i i a 
That is, a x? random variable is defined as the sum of ones tae . ae i = Jd TE d for t <>. 
independent standard normal random variables. Its density func i a ee = 
the following form: l : J gad ž i = know that the m.g.f. of the Gamma (Pearson Type III) distribution, 
ar). 2 : ” , . Se A 
, -1 gt 12 2< æ a , 
ON r pats er, ee ` 0<x : l ns 
(= aE ea a f= <P ea, is Mao ( ) . 
Random variablés having the above density prn se sid l = ; (p) B Ura l 
reedo: fe = a i : 
Beith wees "... Comparing, we find that-p = anda =>. Thus the p.d.f. of x7 is 
2 i ter n, called the degrees of freedom, is a r ve z ar z 
by Xin) » Where the parame er n, J i j i | j i | 
| meen ak bi yo ; og o of a f(x?) = T(n/2) rh evi 
x 24 i F i SrL 
i n a , 
h monient generating at 3 l 
7 obtain ttie distribätion e 22 aie gre ce i i ae Lezi, Oc xr<w 
i= y ; ; ; a 
function echnighs The m.g.f. of the chi-square defined Rent is... Be = 9n/ T (n/2) 
© Molt) = = Eet?) = Eiez? ] N , If n= 1, then x2 = Z? so that the square of'a standard E i 
Bania = EleŽ: e2.. een] : í , random variable is distributed as x2- with one degree of freedom. 


2 
Moreover, the quantity E | ba i taikiai aaite with oae 
=E te eZ; ae ia Elez? J as Z; are all T FA 7 Aa 


i=] degree of freedom as X is normally distributed with mean p and variance - 


2 8 p o?/n. , 
E A 2 -z2?/2 . í , 
Now "[e!@;] = j e. eae ‘17.2.1. Propertics of the TEA Distribution: ` The chi- 
i T square distribution has the following properties: l 
y ‘e-01-21)2?/2 dz f (i) The chi-square is a continuous distribution ranging from zero to 
Žo 27 “4 


plus infinity, i.e. 0 < x2 < œ. | : 
= , (ii) The mean of a chi-square distribution is equal to the number of 
EE fi 2t gtl- ane ys dz l . degrees of freedom and its variance is equal to twice the number 
1- 2t -o 


i 5 } 2 
of degrees of freedom. That is, Elkin J = nand VarlXo J = 2n. 
But for t < 5, the ig on the right is equal to 1 as it represents (iii) The moments of the %2- distribution about the origin are found a as 


the total probability under a normal curve with mean zero and variance below: 7 
‘Therefore, f o ' f The m.g.f. is given by 
«4 ? , Şi 
1—2t ; . we 
af 1 gual Molt) = fe — a ar- „g 2/2 a(x?) 
ade er for t<. 7 ee aarẹ) 
. f l E À 
a = —n/2 =, 
„Hence Molt) = Tee) b. Q = 207/2, fort <3 


= Expanding M(t) in power series, we get, 


s ` 


TIC EORY 
INTRODUCTION TO STATK. TICAL TH 


172 


n 
Molt) = 1 + 9 Qt) +t 


n\{n a + m=) , 
G G + 1) ” G (207 + son 
Get Yre 


o i vi ansion of Molt) 
Thus H= Co-efficient of S in the exp } 


n pTi 
i z LA E Ge F ) 
TOGDA 
Putting r = 1, 2, 3 and 4, we obtain 


Bs "entn + 2), 
By =n, H, n( s 

fo n + 2) (n +4), 
B, n( 


u’ = n(n + 2) (n + 4) (n + 6). 
4 
| = 12n?2+48n. 
4 =w —(p')?=2n, pg=8n and p4=1 
Thus E(x?) = H5% Var(%® =H (H) n, H3 
12 The cumulants of the 


r 8 i a 3.4 SF « 
Furthermore, B; = T and By n 


istributi i Il orders. 
distribution also exist for a 
The curve of a x2-distribution is positively skewed. The skewness 


(iv) me 
decreases as n increases. For example, when ; 


-42/2 
(%2) = e7112, 
ro TET 


the curve is extremely J-shaped and the skewness is the highest. 


ij: | ey, 
When n = 2, f(x?) = zE 12, 


3 . . . z h . 
The curve becomes a steadily decreasing exponential curve wit 


1 
initial point (0, 5). 
2m 
r . t 
The curve originates at the point (0, 0) rises to a maximum 4 


x? = n-2 ie., 1, then decreases with the X-axis as napa 
-.and the curve is unimodal. For n > 3, the distribution takes 


1 (1/2) 5-4/2 
When n=3, f(x?) ane exe, 


THE CHI-SQUARE DISTRIBUTION AND STATISTICAL INFERENCE 173 
a IIIA re OSes 


. curve similar tc that for n=3. The mode of a chi-square 


distribution is thu: equal to x-2. Asn increases, the distribution 
becomes more and more symmetric. 


f(#) l 


8 10 12 14 (2) 


X?-distribution for various values of'n 


V) . The %?-distribution tends to normal distribution as the number ` 
of degrees of freedom approaches infinity, 


The m.g.f. of xe, is Mo(t) = (1 — 20)-n/2 


: 2 
: š ; x7 
Let us consider the chi-square standard variable 


J2n We 


Then its m.g.f. about mean is 


y : at \-R/2 
M(t) = elim Mo 6 = el fin (2 = F) 
O, 


Tr 


Taking natural logs, we have 


t2 
a 


a, 


Jn 2! lieo ‘ 


2t 14t. 1. 
=e 2 ant higher powers of ot 


n 
+l 


t2 : 1 l 
+ ar Pat higher powers ta 


+ higher powers of = 


e 


Tion TO STATISTICAL THEORY 


i : .- INTRODUC ee — 
i a a P/2 which is the 


5 so that Mit) = 


(t) > 
Thus as n + œ, In M! “andom variable. 


Hence the 
f the standard a 
o standar 

random veriable E= tends t l 
tion tends to normality as n 


m.g.f. o d normal distribution 


and consequently the y?-distribu 
approaches infinity. 


imation to the X%- -distribution tp vet by 
t approxi 


(vi) An importan 
l R.A. Fisher (1890-1962) who showe 


h dom variable 2x? is pipini Spam 
n, the ran 


distributed with mean 
Wilson and Hilferty gave 


a better and more Pi 
7 ri x is 
y showing that the random variable B 


approximation b 


` 2 
; ta i —- — and 
approximately normally distributed with mean 1 9 an 


: 2 
jance —— 
varia on’ 


2 
(vii) Additive Property. If X and Y are independent X ag 
variables with nį and ng degrees of frz2dom respectively, then 


the sum X + Y is a x?-random variable with n, + ng degrees of 
freedom. 
The m.g.f. of X+Y. = fit of Xin, y [m.g.f. of Xin, y 
=a? any? 
ean 


But this is the met for a x random variable with n} + n2 
degrees of freedom. Thus the sum of two independent chi- 
squares is itself a chi-square. This property can be extended to 


any number of %? random variables. , 
(viii) Partitioning Property. -A y2 random variable can be 
partitioned into two or more than two parts which are also 4? 


random Variables and the sum of their degrees of freedcm equals 
the total degrees of freedom. 


Let X,, Xo .... X, be a random saripie from a normal. population 


> N(u, 6”). Then the quantity 2X; - pu)? may be expressed by the 
" following simple identity: 


n 
A P EX-B +n X-py? 


i=l i=1 


d that for sufficiently large f 


2n—1 and unit variance. In 1931, 


THE CHI-SQUARE DISTRIBUTION AND STATISTICAL INFERENCE "175 
— Ml 
Dividing all the terms in the above identity by 0%, we get 


i 3S = ie zË Z} oi E e} 
= Oo fel oO oO 


x Pe Xj; F oe eee 2. i : 
_It is obvious that Z —| is distributed as Xin) being sum of n 
r [ey e a 


standard normal random variables. 


wht yy 
P) may be written as Gaal which is a Les 
S fs. o/ yn p 3 ; 


random variable as the square of a standard normal random variable has 
E A og 
a X distribution. 


The term n í 


‘ n(x. 3\ + 
.Furthermore, it has also been shown that £4) = ae 
ims OY (o) 


follows the %?-distribution with (n — 1) degrees of freedom and that its 
distribution is independent ola Fi we ri the identity 
< io atir ; 


This property forms the basis of the Analysis of Variance 
(Chapter 20) as this is the distribution of the sums of squares. 


17.2.2. The x ?-table. The areas for the x2-distribution have been’ 
tabulated for various values of a me n, the degrees of freedom. Table 


17.1. on page 176 contains values of x, iny which denotes the values for 


. nie the area fo its right under the chi-square distribution with n(n= 1, 


. 30) degrees of freedom is apa to a. 


* Gentle 17.1 Compute as for 30 and 50 degrees of freedom by 
(i) Fisher’s approximation, (ii) Wilson-Hilferty approximation, and 
compare them with the tabulated values. 

We can use the facts that 

(i) Z= 2x? -fən -1 


(x2/n)1/3 = h a 2 


G) Z= MARSA (WilsonśĦHilferty approximation) 


(Tisher’s approximation) 


2 
9 . 
are approximately normally cistributed with zero mean and 


unit variance, to compute the values of Xa for n degrees of 


freedom (n 2 30) from the tebles of the standard normal 
distribution. 


INTRODUCTION TO STATISTICAL THEORY 
ae 


7.1 Distributiv: of% ? 


Table 1 2 
jes in this table are values of Xan) 
i : i 


der the chi-squa 


i for which the area to 

The ent 
their right un 
equal to a. 


re distribution wi 


0.08 0.025 


- 7 502. 54l GM 
—i7 set 5 ; 
0.001 on tot 59 738 782 92 
0.051 103 ag 782 935 98A LL34 


0.0002 0.001 
.020 .010 


‘is -185 0.216 n mag 949  1i14 1LGT 13.28 
297 129 0.484 u5 9.24 11.07 12.83 13.39 15.09 
s> | ss4 762 0:831 ior 10.64 1259 1445 15.03 16.81 - 
e| 8 3 1.24 T i202 14.07 16.01 16.62 18.43 
q (oz 156. L6 Sag aaao 18.51 17.54 18.17 20.09 
B. | Les 203 i gag 68 1692 19°%2 19.08, 2167 f 
9 |209 253 2% got 15.99 218.31 20.48 21.16 23.21 
10 | 2.56 3.06 3.25 is 17.28 19.68 21.92 22.62 24.72 


i 4,82 14.58 E miie dn 
3.05 A Tyo 523 1865 2L03. 2334 2405 26.22 


5.01 22.36 24.74 25.47 27.69 
TE F nie . 23.63 26.12 2687 29.14 
ae e 22.31 26.00 27.49 28.26 30.58 
5.23 598 6.26 7.26 ; Ar a 
sai G6 G9 796 2354 26.30 2884 2964 “f 
ca A2 aa BoT ZiT ZTEI BOMID oo aR 
J02 79 823 949 25.99 28.87 3 32.35 B8 
763 8.57 BOL J012 24.20 , 30.1 33.69. 36.19 
8.26 924 959 10.85 2841 TAL 35.02 37.57. 
s90 9.92 10.28 11.59 29.62 3267 35.48 3634 38.93 
9,54° 10,60 10.98 12.34 30.81 33.92 36.78 37.6G 40.29 
10.20 11.29 1169 1309 3201 35.17 38.08 38.97 A LGA 
10.86 11.99 {240 13.85 33,00 3642 39.36 40.27 42.92 
25 |1152 1270 1312 14.61 34.38 37.65. 40.65 41.57 443I 
26 |1229 13417 13.84 015.38 35.56 38.887 41.92 4286 45.64 
27 |1288 1a? LLST IS 36E AOIL 43.19 Hti 40.96 
28 |1356 14.85 16.31 1693 8792 ALG AAG 45.42 18.28 
29 j 14.26 15.57 19.05 17.71 39.09 4256 45.72 46.69 49.59 
: W495 IGAL 1679 M89 A026 A377 AGAS ATIG 50AN 


For n > 30, the expression V2x2 - \2n -1 may be used as a 


normal variable with zero mean and unit variance, remembering that 


the probability for x? corresponds with that of a single tail of the normal 
curve, 


3.57 


"Table 17.1 (except columns 4 and 8) is taken from ‘Table IV of 


Fisher ‘and Yates: Statistical Tables. for Biological, Agricultural and 
Medical Research, published by Oliver & Boyd, Ltd.; Edinburgh, and 
reproduced by permission of the authors and publishers." 


t 
; th n degree of freedoin iş . 


THE CHI-SQUARE DISTRIBUTION AND STATISTICAL INFERENCE 177: 


From Fisher’s approximation, we get 


1 
x= Zla +\2n -1]? 


and %-om Wilson-Hilferty approximation, we obtain 


3 
=n (1-4) a E] 


where Zz, is the value of standard normal random variable 
„corresponding to the level o of significance. 


Substituting @ = 0.05, n = 30 in Fisher’s approximation, we get 


2 1 E 
%o.05,(30) = z [1-645 + V2 (30) -17 (7 Zoos = 1.645) 
1 1 
= = [1.645 + 7.681]? = = [86.974] = 43.49 
2 2 


For n = 50 and ao = 0.05, 
[1.645 + V2 (50) — 1]? 


(1.645 + 9.950]? 


I= ple Nje 


2 
%X0.05, (60) 


[134.44] = 67.22. 


2 


Using Wilson-Hilferty approximation, we get for n=30 and &=0.05, 


3 
2 2 x J 2 
X0.08, (30) = 30 [i -ggo + 1845 T 


= 30 [1 — 0.0074 + 1.645(0.086)]3 
= 30 [1.134]3 = 30 (1.458) = 43.74; 


and for n = 50, we have 


‘ 3 
2 2 2 
%0.05, (60) 7 50 [l ~ 90) * T649 am 


= 50 [1 — 0.0(44 + 1.645(0.067)]3 
= 50 [1.1058] = 67.6. 


These values agree very well with the vabulated values of AN 


= 43.8. and X5.05,50) = 67-5. 


178 


INTRODUCTION To STATISTICA, 
0 


TERVAL ESTIMATE OF V. t 
17.3 CONFIDENCE IN a 
A NORMAL POPULATION TANCE o 


"The confidence interval estimate of the population yay; 
based on the sampling distribution of S2, the sample ae | 
sampling distribution of S? is the chi-square distribution, me and ty 
use the 72-distribution to obtain confidence interval esti therefor, 


. * . p mat 
when.we are given (i).one sample variance, (ii) several sam 1 © for g 
PIC Variances’ 


17.3.1. Confidence Interval Estimate of c2 fiom 
Variance. Let Z and S? be the mean and variance of a aden oe 
Xi, XX, of size n drawn from a normal population with Ke Sample 
Then the statistic i Nance g? 


nS? _ D(X, -X)? [or 223? 
ais a ee 


e gl: 


wheres? = 204-9] 
n—1 
that is the ratio of the sum of squared deviations from the sample m 
ean 


to the population varian S m q - 
ce ha a chi square distr ibutio W. 
} n ith (n ) 


To construct a two- 
sided confidence inter val for o we find two 
, 


values of %2-distributi i 
ric ution with (n-1) degrees of freedom, say a and b, 


“ 


(y2 % 4 
Sa a0) = Lang figa = 


We then have 
contains the vari 


interval with an 


ve an i 
ance G2 as asso-'ated probability 1-c, that 


THE CP'-SQUARE DISTRIBUTION AND STATISTICAL INFERENCE 


17S 
2 
P [a <E <b] =1-a. 
Cc 


To put o? inside inequalities within the brackets, we nroceed as _ 
below: i ; : 
(i) . We divide all terms inside the brackets by nS? and get 


a E 
nS2 o2 nS? 


g k 
Gi) We replace each term with its reciprocal (remembe-, we inverse 


the direction of the inequality signs when we replace a term by 
its reciprocal) and obtain ` 


nS? nS? 
LE EN E 
a b 
which is equivalent to 
nS? 2 _ ns? 
na 2 
b a 


Gii) We substitute this result in the probability statement and get 


nS? 2 nS?) _ 
PIF <O%< 7 J]=in0 
-J92 -p2 
or p [2A < o? < ZAD ]=1-a. 


The values of a and b are obtained from the tables of the chi-square 
distribution having (n—1) degrees of freedom, which are found as 


2 2 
a = Xiao and b = Xa/2" 


Hence for a particular sample of size n, the 100(1—a, percent 
confidence interval for oĉ is given by 
p2 = 2 
a X) aidia 2x, X) ] , 
Xa/2 X i-a/2 
* waere Ca and b = ee are the values of a ¥2-distribution having 
(n--1) degrees of freedom leaving areas of 1-0/2 and o/2 respectively to 
the right. Thus a 95% confidence interval for o? would be 


Gas? o2 CD? 


2 
Xo0.025 Xo.975 


\ 
` 


‘180 


T7 
eni i. can obtain a confiden 
me the end points of the int 
O cannot be estimated with 
17.3.2. Co 
; +2. nfi 
Variances, Suppase, 
Populations having eq 


by tak; 

erv. š k 

al for o2, but si : ing the . 
Perience ey, Nar, 


much precisio 
n for smal Shi 
dence Interval of o2 l sample he 


ual variances G2 Samples from ame 
e 


variance o2 2 22 or fro 
- Let Si, So, ..., S? be th m the same population." 
samples of sizes n., n k e sample variances he lon with 
i axe e A 
given by b 72) +++) Ny. The pooled unbiased esti iho 
: R imate of g2; 
is then 


2 
2 NS + 2 
s oat a 


ny + 
1t ngt... + = 
When the Np—k 
populati 
. oe 10n or populations are Kemal 
distribution of AniS; (Xn; — k) s? al, the sampling 
or 
A would.b : ; 
W. o? e a chi-s uar TE. n 
ith (Xn; — k) degrees of freedom iici 


Hence the (1-a 100 % confid 
) ence inter val for O“ is given by 


2 
Zn;S; Ens? 

poe ot gt 
Xas2 X1-a/2 


where x? 2 
1-a/2 aNd Xj. are th 
/2 e values of a chi- Been 
(Xin;—k) degrees of free Aom chi-square distribution having 


` Example 1 
7.2, Ar 
random sample of size n=8 fr til 


population 
fe gave the values 9, 14, 10, 12, 7, 13 
confidence interval for o2. > -6 f 13, 11, 12. Find the 90 per 


First we calculate the sample mean x, which i 
; is 


8 
Then > (X;—X)? = (9-11)? ` 
Piet a (9-11)? + 14—17)? + ... + (12-11)? = 36 


l rrom the Pan we find that x? for 7 d.f. at œ=0 10 level of 
_ significance, l.e. Xo 95 (7) = 14.07, and Xo 95,(7) = 2.17. 


Hence the 90 per cent confidence interval for 67 is 


THE CHI-SQUARE DISTRIBUTION AND STATISTICAL INFERENCE 181 
5,2 ge EAD 
2 
Xo.05,7 %o.98,(7) 
36. 9 _ 36 
or == 407 <9 * 2.17 


or 256 < 0? < 16.61 


Thus t 


TS BASED ON CHI-SQUARE DISTRIBUTION 
of hypotheses, that are based 


he 90%. confidence interval for o2 is (2.56, 16.61). 


17.4 TES 
e of the most frequently used tests 


Bi 2-distribution, are presented in the sections that follow. 

sting Hypothesis about Variance of a Normal 
Population. Suppose we desire to test a null hypothesis Ho that the 
ariance o? of a normally distributed population has some specified 
o this, we need to draw a random sample X}, Xo =» 
pute the value of the 


Som 
17.4.1. Te 


v 


2 
value, say So: Tod 


X, of size n from the normal population and com 
n 


ance S2. If the null hypothesis Ho: o? = ob is true, then the 


sample vari 

nS? a , z . . 
statistic X? = 7 has a X2-distribution with (n—-1) degrees of freedom. 
0 


The format for the test would be as below: 


(i) Formulate the null and alternative hypotheses about 6”. Three 


possible forms are 


2 2 
Hy: o? = Gy and H,:07 #90 


2 

Hy: 02 <09 and H,:67 > Oo 
ting anh B eae ee 
Hy: Of Z Og an 1° 0 

s are at 


Gi) Decide on significance level a. The commonly used value 


a=0.05 or &=0.01. 


(iii) The test-statistic to be used is 
2 nS?_ EE- x? 

yy 2 

(n: ) So ‘5 Op 


s a chi-square distribution with (n-1). 


which under Ho ha 
degrees of freedom. 


182 


> INTRO ` 
(v) Determi DUCTIO 
ermine the critj NTO st 
critical regi ATISTI 
: 81 CAL y 


alternati on whi 
rnative hypothesis H r Which depends on, N 
(o3 


(a) When H, i | 
= nd 
118 O” ¥ Oo, the critical region ; i 
h gion is 
l-a/2,(n-1) and X? > x? 
(b) When H, is g2 2 X > Kazini (Two-taileg 
so 
: 1 > Op the critical regio E 
and its value is x2 >. x? i dii a "y 
tail 


(n-=1)' 

(c) Wh F 3 (One-tai 
Rg op the critical tailed testy 
re 


left tail wi . gion will be enti l 
(v) ith the critical value y? © entirely in the 
v ~a,(n—1)* (One-taj 
Compute the value ue nS? : ne-tailed test) 
rom the gi 
(i) Deci o9 he given data, 
ecide as below: 


Reject H, i 
jt i o if the cal 
ħerwise ee i culated value of X? falls i th 
n the critical regi 
egion, 


was n= 
n=16 observations 


(a) i = 31.5. The sample size 


'T 
aa the hypothesis th 
ternative that the v. 


at the po à 
of signi pulation variance į > 
gnificance, 1s 25 against the 


ari . e l vel 
rance 1s gr eater than 25 U 
. Se a 0.05 e 


(b) Cons 
truct the 95 
tensil aplasia imi 
ile strength of the wire idence limits on the vari 
' ‘lance in the 
I.U., M.S 
s M.Sc., 1991) 


S 


(a) OW 
e have to decide between the h 
Ypothese 


H,:o2 = 9% 
0: O° = 25, and T, 


o2 
« 1:04 > 25 
) The level of sıgħifi 


cance is @ = 0.05 
(iii) The test stat’stic is 42 = nS? 


2°? which u Š 
Oo, nder Ho, has a 


X2-distribution wi 
7 n with (n—-1) degr 
that the population is ss i of freedom, assuming 


THE CHI-SQUARE DISTRIBUTION AND STATISTICAL INFERENCE 
=25.0 (one-tailed test) 


183 


(iv) The critical region is x2 > 1.08 (as) 


(v) We calculate the value of x2 from the sample data as 


2 
pt. 1619) = 20.16. 


o2 
0 


the calculated value of x? falls in the 
so we accept our null hypothesis, i.e. 


nh 


e to conclude that 67=25. .,.' 


(vi) Conclusion. Since 
, acceptance region, 
we have reasonable evidenc 


(b) The 95 per cent confidence interval for o2 = 25 is given by’ 


nS? 2 nS? 
PE oe 
%0.025,(15) X0.975,(15) 


Substituting the values, we get 


16 (31.5). g2 g 16.815) 
27.49 6.26 


or 18.3 < o? < 80.5 


Hence the desired confidence limits on the population variance are 


(28.3, 80.5). 
Example 17.4. Given that X; 2re 
the sample values ¥ = 42,5 =5 and n 


normally distributed and given 
= 20. Test the hypothesis that 


o=8. 
G) Our null hypothesis is Ho: 5 = 8. Let the alternative hypothesis 
be H,:6 #8. (Two-tailed Test) 
Gi) We choose the significance level at @=0.05. 
nf ns? 2 , i 
Gii) The test-statistic 1S ~ 2 = Aip E Xps are normally 
oO 
0 


ic under null hypothesis has a chi- 


distributed, and this statist 
square distribution with (n-1) degrees of freedom. 


2 
(iv) The critical region is x? > A =32.85 and X? < X0.975,09 
= 8.91. 

(v) Now we compute the value of x2 from the given data as 

2 20(5)? _ 20 x25 
2= nS” =-——7 = — = di. 1 

ae a. oo 

0 


184 
i falls i 

region, we therefore reject the hypothesis, a the Crit; 
there is no evidence to accept the hypothesis th ata o Tade th 


_ Example 17.5. The manager of a bottling plant ; 
reduce the variability in net weight of fruit bottled. Over Is ą 
the standard deviation has been 15.2 gm. A new machin a lo g Petiog 
and the net weights (in grams) in randomly Selected ie is j , 
same nominal weight) are 987, 966, 955, 977, 981 sor tles (ay 
972, Would you report to the manager that the new ues 975, 980, 95 
performance? (M.Se., P.U., 19 a ts a better 

, LU., 6) 


(i) We'have to decide between the hypotheses 
Hy:0 = 15.2, ie. the standard deviation is 15,2 gm 


H ; } ( e 
O< 15.2 Le. the standard deviation has been reduc d 
1l . 


he We choose the Significance level at q = 0.05 
(iii) The test-statistic is = 


7° = 
95 o? i 


weights are normally distributed. 


= 9 
x? = “210.1 _ 1110.1 
(i) Co l Ap 231.04 = 4.81 
ncelu on. Sj 
* Since + 
e r oe, “eiculated value of x2=4.81 d t fall 
Othesj W i =4. oes no 
ould 38 Hees eats me incu cannot reject the null 
0 ation i 
etter Performan; to th Manager the -i 1s 15.2 gm and hence we 
17 Te c at the new machine has a 


4.2, 
afk (k> 2) Sting Hypotk ec: 
hypothesis Normal Poulos “bout the Equality of Variances 
lons, Suppose we wi tee + ake null 

3 sh to test the 


THE CHI-SQUARE DISTRIBUTION AND STATISTICAL INFERENCE 185 


2 2 2 
7 Ho: 0] = 03 =... = 6, (=0%, say) 


against the alternative 
Hi, : Not all the variances are equal. 
For this purpose, several test procedures have been devised. The 


test procedure due to M.S. Bartlett is presented here. This test 
procedure is based on a statistic whose sampling distribution is 
approximately a chi-square distribution with (k-1) degrees of freedom, 
when k random samples of sizes Ny, Ng, ww Np (Ln; = n) are drawn from 
independent normal populations with variances of, o rei o$. Let sh s? 
Kas s? be the unbiased estimates of the populations variances, computed 


from the k samples, and let these estimates be combined to give the 
pooled unbiased estimate of variance, if Hg is true, as 


k 2 n; 
(n; = 1) s; 2 (x; —%,)? 
2_ isl t Eni- Ds; 2_ i=l 
=>, whens, e E; 
n-k nj-1 
Èn- 
i=] 


Then the test statistic is given by 


u = 2.3026 2, 
c 


k 
where q = (n — k) log s? — > (n; - 1) log s?,, and 
i=l 


k ‘ 
a > _——e ; 
3(k-1) \,n;-1 n-k 


The statistic u. under Ho, has approximately a chi-square 
distribution with (k—-1) degrees of freedom. The value of q is large when 
the sample variances s? differ greatly and is equal to zero when all s; are 
equal. Therefore, we reject Hy when the calculated value of u is greater 
than or equal to the tabulated value of x? at the desired level of ` 
significance for (k—1) degrees of freedom. The rest of the procedure for 
testing the hypothesis is the same. 

This test is generally known as Bartlett’s test for homogeneity of 


c= 


variances. 


186 INTRODUCTION TƏ STATIStic A 
L 


T 
Example 17.6. Suppose that four random samples of s; Hory 
ng=9, ng= mi ng=l5 oi selected m3 four normal Populations me 
2 a 2 = and gave 
s| = 392, sņ = 427, są = 620, s, = 667. Test the hypcthesis p 
variances, I.U., M °F equa 
i iat S 
(i) The hypotheses would be stated as Ha 
o TIE 
Hy:0,=0,=0,=0,, and 
H, : Not all the variances are equal. 
(ii) We use a level of significance of & = 0.05, and one-sig 
(iii) The test-statistic would be TSA 
u = 2.3026 2 
c 
where q = (n-k) logs? — : 2 
p at — 1) log S; and 
i= 
k 
c= 14 ea 1 1 
PE SORRE 
The stat; aa izil n-k’ 
statistic y under H 
distribution with (k-1) q pe approximately a chi-square 
Gv) grees of freedom 
The critica] tenions 2 ; 
i hen Rey ae = 7.8] 
v Computations for Bar shi j 
artlett’s test for 


25.9330 
21.0432 


30.7164 


117.2300 


THE CHI-SQUARE DISTRIBUTION AND STATISTICAL INFERENCE ` 187 


k 9 
Now q = (n-k) logs? — Ð (n; 1) logs, 
i=] 
= (43) (log 546.3721) — 117.2300 
(43) (2.73746) — 117.2300 = 117.71078 — 117.2300 


= 0.48078, and 
1 1 1 
c=1+ [E - 
3(k-1) “n;-1 ane 
at + Š (0.8873 — 0.02326) = 1 + 0.0404 = 1.0404 
(0.48078) 1.10704 
HeT 2.9026 (SE) =o 0 


(vi) Conclusion. Since the computed value of u does not fall in the 
rejection region, we therefore cannot reject Hy and may conclude 
that the variances are homogeneous. 


17.5 KARL PEARSON’S APPROXIMATION 
Karl Pearson (1857-1936) has established a relationship between 


_the discrete multinomial distribution and the chi-squre distribution by 


transforming and making the multinomial distribution approach a X?- 
distribution as n approaches infinity. This approximation is widely used 
to test agreement between the observed data and the expected (or 
hypothesized) results. 

Suppose a random sample of n observations is distributed over k 
mutually exclusive and exhaustive classes or cells. Let p;(i=1, 2, ..., k) be 
the probability that an observation falls in the ith class or cell and n; be 


the number of observations falling in that class such that 2p; = 1 and 
Èn; = n; that is, the data have the following multnomial structure: 


Pi P2 


Then the probability function f(n}, Ng, ... Np) of n, belonging to the 
first cell or class, ng belonging to the seccad cell or class, ... and ny 
belonging to the kth cell or class, is given by 
n 


. n No 
f(y, ny vy Mp) = 7! Pie sox Dre 


188 INTRODUCTION TO STATISTICAL TH 
—— Eo 


Assuming that n and all n; (i=1, 2, ..., k) are sufficiently larg 
e, We 


” apply stirling’s approximation to n! and obtain 
Gap" E eng Pip h -pii 
OTni npt VE a T, 


a cay? 
_\n “(ny 

(27n) $-D/2 (yo...p,)}/2 
k Nni+1/2 
nC) 


ie" 


= m 
k 
(2nn)@-0)/2 T] p;1/2 


i=] 


fry, No e. n) = 


Now Elni) = Np; meaning that expected cell fre 
(cell probability) and Var(n;) = np; (1—p,) fori = 1 2 k 


The standard normal variable corresponding to n. is 
L 


quency = n 


“o Np), Le., replacing n; 


k 
I l 4 “a 
` fin, No, vey n) = i=] np; 


(27n) (k-1)/2 (Pip, p) 
Dk 


=-= È (Zo, ly [Z, 
iy 71+ my + 2 Ed Ge) ; (a 3 ] 
Replacin i 2A np, 3\ np; ) 


o. : 
descending o ee Value 


rder , VRD (1 =. 
T ofm, we obtain ° 1P) and arranging the terms in 


THE CHI-SQUARE DISTRIBUTION AND STATISTICAL INFERENCE 189 . 

k k 

1 

InM=- 5) Zi\p; (-p;) n1/2 + DS (Zaraz?) + terms of lower order. 

i=1 i=] 

`k k 
Since $ Z; Vp; (1 -p;) .n1/2 = È (n; — np;) 

i=l i=l 

k k 
= Zn;-n Ep;=n-n=0, 
i=1 i=l 


k 
1,2 2 1 
and Gz qi- qi z) szo È 
i= 
M = e /PEni-np)?/np; _ gt 


k (n;-n .)2 
where %2 = pe R 
in. "Pi 
Hence, we get the result as 
2/6 
-x"/2 2 

= Berl? 


fry, ng nonpa 


where B = (27n)0=+)/2 , (p p3...pp)!/2 


` That is fii ws n,) varies as that of the sum of squares of k normal 
k = 
variables subject to DZ; Vp; —p;) n = 0, but otherwise independent. 
isl 
Thus it is a good approximation if all the expected values np; are 
sufficiently large and hence the x2 as defined above, conforms to the %2- 
distribution with (k—1) degrees of freedom. 
We therefore corclude that, if a sample of n observations is 
distributed over k classes cuch that the observed frequency in the ith 
class is n; and the expected frequency in that class is np,, the statistic 


ž (n; = np;)? * Observed n; — Expected n;)? 
P= DTE E n g Cbserved n — Expected n 


ii "Pi Pa Expected n; 


called the x?-statistic, has a x?-distribution with (k—-1) degrees of 
freedom. 

It is clear that x? will be small when all n; are close to their 
expected values np;. The x2 will become larger when the difference 
becomes larger. The x? thus measures the amount of deviation (or 
agreement) between observed and expected results. 


INTRODUCTION TO STATISTI 
188 CAL THEORy 
Assuming that n and all n; (i= 1, 2) +» k) are sufficiently lar 
" apply stirling’s approximation to n! and obtain Be, We 
-o mnt 1/2 e-n p" po” p,™ 
fny Nay or My) = Toka t os ngt ro Pig tay 
npi n,+1/2 np,\** 1/2 
GG 
(ann) '*-1)/2 (p po.. py)? 
i np; nj+1/2 
ni 


i=] 


a k 
(2mn)(k-1)/2 [I p; 


i=1 


Now E(n;) = np; meaning that expected cell frequency 


(cell probability) and Var(n;) = np; (1 — p;) fori = 1, 2, ..., k = 


The standard normal variable corresponding to n; is 
l 
Fim a OD 
Vap; (1 - p,) o 


sot i= 
i v np; = n; ~Z;0; and n= ZO+ np; 
ubstituti : 
byte ng these transformations in fn ; 
equivalent, we get b No, vee Np), i.e., replacing n: 
š A t 


k 
I ( 4 Zi gion /2 


Sh teh 
? 
(2mn)(k-1)/2 (p 
1P9...p )1/2 
-M - 
. C> say. 
aki 
ing natural logarithm (n) of M, w. 
ı We get 


k 
nM => 
-Z.6, - 
ze i9; mi ta(a 2) 
np; 


—_ 


k 
=- 5 (Zo, 1, TZO; 
fei et » -ife + aza) ] 


np; Zizi 
Replacin . io SAnpi/ 3 \npi 
A & O; by its val ! 
descending order of n, we ete np; (1 -p;) and arrangi 
ranging the terms in 


THE CHI-SQUARE DISTRIBUTION AND STATISTICAL INFERENCE 189 . 


k k 
InM=- >. ZNP: a-p) n! + È G 24-42) + terms of lower order. 
i=l] i=l 


(3 i= 


`k 
Since Ð Z; Vp; Q- pi) n"? 


k 
È (n; -npi) 
1 


i=1 i= 
k k 
= Znin Zpi=n-n= 0, \ 
isl isl 
k k 2 
1 22 2 1 & (n;- np) 
and S(eha-att)=-3 2a, 
ase 2a] Mi 


M= g7 0/92? /npi = etl, 


k (n; — np)? 
where X? = yne, 
j “Pi 
Hence, we get the result as 
-x?/2 
f(ny, Na oT g 2e , 


where B = (2mn)(1-*)/2 PPP? 
+ That is f(y, «+. Mp) varies as that of the sum of squares of k normal 
k 


variables subject to È Zi Vp -ppn = 0, but otherwise independent. 
i=l i 
Thus it is a good approximation if all the expected values np; are 
sufficiently large and hence the x2 as defined above, conforms to the X?- 
distribution with (k—1) degrees of freedom. 
We therefore corclude that, if a sample of n observations is 
distributed over k classes cuch that the observed frequency in the ith 
class is n; and the expected frequency inthat_class is nP; the statistic 


2- s uznet _ 5 {Qhserved n, — Expected n;)? 
ai nPi sì Expected n; 


i=1 = 
called the x2-statistic, has a y2-distribution with (k-1) degrees of 
freedom. 

It is clear that x? will be small when all n; are close to their 
expected values npi. The x2 will become larger when the difference 
becomes larger. The x2 thus measures the amount of deviation (or 
agreement) between observed and expected results. 


bject, th 
atisfi e 
(i) The size of the sample n or the total number a ed; 


should be at least 50, otherwise the Zi will n ot Oservatio: 
distributed. i normaly, 


(ii) The observations in the sample or the fre 


uencies i 
or cells should be independent. quencies in the Classes 


(iii) The restrictions or constraints, if any, should be lin 
ear, 


(iv) The expected number, np; ia any of the clas 
not be less than 5. So when the expected fre se 
class is less than 5, We may combine this cla “fee 
other classes to meet this requirement iii 


or cells should 
ncy, nD; in any 
th one C? mora 


AAN paa Hypothesis about p’s of the 
ton. In multinomial type pr 
problem ri 
cells and where the cel! probabilities prs en oo Hoe 
l 


procedure for testing the hypothesis H, - P 


below: i5 Pioi = 1, 2, ..., k, is give 


O Formul i 
ulate the null and alternative hypotheses about 
ut p’s as 
iy ; Po 5 oy Dp = Pro » and 
1 `P; * Dio for at least one value of i = 1 2 


where p , k, 
äi Np 10 Pom Peg are Specified values 

cide on signifi f 
(ii) Bnificance level q, 


The test-statistic to use is 
. k . 
2 - x (n; = ND ig)? 
Which if H is t cea. 
ne ’ 0 Tue h 
ou » Nas an i 
(k~1) degrees of freedom, iinis ae =“ 
Determine the criti l 


a cri 
degrees of freedom tical 


(iv) 


THE CHI-SQUARE DISTRIBUTION AND STATISTICAL INFERENCE 


191 


Accept Ho, otherwise. 


Example 17.7. In a genetic experiment involving the crossing of 
two types of peas, Mendel observed 315 round and yellow, 108 round and 
green, 101 angular and yellow, and 32 angular‘and green seeds from the 
plants but his theory of heredity called fora 9:3:3:1 ratio. Using a 5 
percent and a 1 percent significance level, do the data support the 
theory? (P.U., B.A./B.Sc. 1972, 82, 91) 
(i) We state our null and alternative hypotheses as 

9 3 3 1 F ; 
Ho: py = Ie’? P2 = 6° Ps = Te’ P47 76 for a multinomial 
distribution involving four cells and with n = 556; and 
Hy : Pi Ż Pio for at least one value of i = 1, 2, 3, 4. 
(ii) The significance levels are set ata = 0.05 and a = 0.01. 


(iii) The test-statistic under Hp, is 


which has an approximate chi-square distribution with 3 d.f. 


(iv) Computations. Under Hp, the expected frequencies are 


np yo = round and yellow seeds = 556 x =. = 312.75; 


3 
NPo = round and green seeds = 556 x Jg = 104.25; 


3 
556 x 167 104.25; 


np39 = angular and yellow seeds 
‘ 1 5 
np 4 = angular and green seeds = 556 x Te 34.75. 
The value of x? is then computed as below: 


5.0625 
14.0625 


10.5625 
7.5625 


5 Š 9 2 
(v) The critical regions are x? = %0.05,(3) = 7.82, and 


WTGDEVCTION TS STATISTER see 
Ory 


192 : 
X? 2 Xo.01,¢3) = 11-34. 
i) Conclusion. Since the computed value of x? does not fall in é 
i critical region at both the significance levels, so we do not vies 


our null hypothesis. There is sufficient evidence to con clude la 
the data support the hypothesized theory. t 


Example 17.8. Two nundred digits were chosen at random from 
set of tables. The frequencies of the digits were: a 


Dgt 0 1 2 3 4 5 6 TB 4g 
Frequency:18 19 23 21 16 25 22 20 21 15 


Use the %?-test to assess the correctness of the hypothesis that the 
digits were distributed in equal numbers in the tables from which these 
were chosen. 


(i) Weformulate our hypotheses as 


i Oe 1 P r 
Hy: Py = po =... = Pio = 10 for a multinomial distribution 


involving 10 classes and with n=200, and 


Hy =p; * Pio for at least one value of i=1, 2, ..., 10, (or the digits 


were not distributed in equal numbers), 


(ii) Weusea significance level of & = 0.05 
: (ii) The test-statistic to use is 


2 = ETI? 
"Pio 


When Hpo is true, the Statistic 


(iv) The rejection region is 1; > 16.92 
05,(9) = 20.92, 
(v) Computatio 
ns, 
aad ce Hp, the expected frequency of each of the 
s dy oy, 1S p. = J 
i = 200 (3) = 20. Therefore 
(n; = np,,)2 
42 = yi Pio (18 ~ 20)2 
oe 0) aa 207 + , G5 = 20)? 
a ee 2C 


THE CHI-SQUARE DISTRIBUTION AND STATISTICAL INFERENCE . 193 - 
ee eee ee eee 


86 
a ton 4.3. 

(vi) Conclusion. Since the calculated value of %7=4.3 does ncz fall in 
the rejection region, we are therefore unable to reject our null 
hypothesis. Hence we may conclude that the digits appear to be 
distributed in equal numbers in the tables from which they were 
chosen. 

Example 17.9. A coin is thrown 1600 times, and 840 heads are 
recorded. Test the hypothesis that the coin is unbiased. Use botk the chi- 
square approximation and the normal approximation and compare the 
results. Use a = 0.05. 

(a) Using Chi-square Approximation (to multinomial) 


(i) We state our null and alternative hypotheses in multinomial 
notation, as 
f 1 1 . nomial denne 
Eo: py = cen. fons for a multinomial distribution 
involving 2 classes and with n = 1600; and 


Hi: both p’s are not equal to 1/2. 
(ii) The significance level is set at a = 0.05. 
(iii) The test-statistic under Hp, is 


2 = (n1 = Mp 10)? pir NP 20)? 
nP io npa 
where n; and ng are the number of heads (840) and number 
of tails (760) in this case. The test-statistic has an 


approximate chi-square distribution with 1 d.f. 


(iv) Computations. Under Ho, the expected frequencies are 


1 
RP yo = 1600 x> = 800, npoo = 1600 x -> = 800. 


2 . (840 — 800)? | (760 — 800)? _ 
w= "300 80 7 


| 
A 


2+2= 


(v) The critical region is x? > oe a) = 3.84 


vi) Conclusion. Since the computed value of y? = 4 falls in the 
critical region, we therefore reject our null hypethesis. The 
data prov.de evidence to conclude that the coin is biased. 


(b) Using Norma! Approximation ‘to binomial) 


We set up our null and alter 
notation, as 


1 
Ho:p = -> and Hpt ‘ 
2° 


(ii) The significance level is set ata = 0.0 
. . = ' 5. 
The test-statistic under Hp is 


z = Fao 
V"P 099 


where x is the number 
standard normal. 


wi j 7 . 
(without continuity correct; 
10n) 


of heads and z iS app, 
‘Okie: 


Computations, We compute the value orZ 
IZ as 


z=—840-800_ 40 


«05 


The critica! region is |Z] > 1.96 
Conclusion. 7 
critical region, 
18 a biased one, 


pree the computed value z = 
o We rej 
e reject Hy and may conclude that they) 
c , 
mi nae The solutions give 
Stile Hes chi-square approxi 
o the normal i 
i `o thie approx 
a vakaia Is the i of th 
ie © categories, either proceduri 
2. Pearson? 
. . n i » 
er can also be oa ae Test For Goodness-of-Fit. AŻ 
iata they lapen: man when the cell probabilities are 1 
the nei Such as the binomi e unknown parameters of a vee 
al distribution, ote ra distribution, the Poisson distributi 
licable y 


identical results, Iti 
mation gives a test thati 
tion when k = 2. Thet 
e normal statistic. Hen 
œ can be used. 


“a si his test is based on the property that ‘th 
their estim that the cell probabilities depend upon unknot” 
his ates an Provided € unknown parameters are replaced W! 
D: Pcrameter estimateg ea that one degree of freedom is deducted Íi 
- Class Probabili p : When kan e : i 
ities there are k classes/categories 4" 


hen the pr are own 
freedor, a R ne number of degrees of freedom is a 
Paramet vld be Keita, ; Upon m parameters, the deg"? 


mber ° 


ers esti » Le d 
m "al = 
ated fro f. = number of classes-1-9U orm 
n 


e ; 
Sample. For example, in ê 


= 2 falls int 


THE CHI-SQUARE DISTRIBUTION AND STATISTICAL INFERENCE 195 


distribution, the cell probabilities depend upon the two parameters 4 
and ©, therefore the degree of freedom is (k—1-2), i.e. (k—-8). 


A goodness-of-fit test is a hypothesis test that is concerned with the 
determination whether results of a sample conform to a hypothesized 
distribution which may be the uniform, binomial, Poisson, Normal or 
any other distribution. This is a kind of hypothesis test for problems 
where we do not know the probability distribution of the random 
yariable under consideration, say X, and we wish to test the hypothesis 
that X follows a particular distribution. In the test proceaure, the range 
of all possible values of the random variable assumed to fellow a 


` particular distribution is divided into k mutually exclusive classes and 


the vrobabilities p;’s are calculated for each of the classes, using the 
estimates of the parameters of the probability distribution specified in 
Ho. The np; represents the expected number of observations that fall in 
the ith class and n; represents the observed number of obszrvations in 
that class. The differences between observed and expected number of 
observations can arise from sampling error or from Hy being false. Small 
differences are generally attributed to sampling error, large differences 
which are considered to arise from Ho being false, 2-e unlikely if the 
hypothesized distribution gives a satisfactory fit to the sample data 
(Ho true). To see whether there is evidence of small or large uifferences, 
the test-statistic to use is 

k (n; — npp? -F (0; - e)? 


2- 
xX = 
jak MPi ĉi 


which, when H, is true, has an approximate chi-square distribution with 
d.f. = k — 1 — number of parameters estimated by sample statistics. The 
symbols o; and e; represent the observed and expected values of n; 
respectively. 

When the observed values are equal to the expected values, the 
x2=0. The larger the differences between %bserved and expected values, 


the larger will be the value of x2. A small computed value of x? indicates 
a good fit and it leads to the acceptance of the null hynothesis. A large 


computed value of x2 indicates a poor fit and it leads to the rejection of 
the null hypotnesis. Hence the rejection region in a goodness-of-fit test 
(and all tests that compare frequencies) will fall in the right tail of the 
chi-square distribution. 
The procedvre for a goodness-of-fit test is as follows: 
(i) Formulate the null and alternative hypotheses as 


Ho: The population has a specified probability distribution, and 


y 
INTRODUCTION TO sTatigy, | 
Ic, 


196 M 
: a {i 

"Ay: The population does not have the specified distributi Heo 
on, 


(i) Choose the level of significance Œ. The commonly useq 
Value; 
Yu 


.a=0.05. 
(iil) The test-statistic to use is 


gio 2 


i= 


e: 


kored? 
(0; — e;) 
1 L 


' which, if Ho is true, has an approximate chi-square disti 
il 


with d.f. = k-1-number of estimated parameters, butin 


(iv) Determine the critical region, which depends upon & 
degrees of freedom. and the 
(v) Compute the expected values and the value of x2, 


(vi) Decide as below: 


Reject Ho, if the calculated value of X? exceeds the y” valy 
«Value 


against the appropriate degrees of freedom from the x? 
- ! e y2. 
Accept Ho, otherwise. X*-table, 


Example 17.10. Five pennies were tossed 1,000 times and the 
. number of neads were observed as given below: 


pote | 6 a 
287 | 164 


T sii Shinde oni ea ict 
be ee a binomial distribution gives a satisfactory fit to 
" (P.U., B.A./B.Sc. 1980 


‘| Number of heads 


Frequencies 


We state our hypotheses as 


Hy T : ; 
i ein distribution is a binomial with n=5, b! 
dias Parameter p unspecified, and 

He Mieto gece. 
Gi) Ww € Population distribution is not a binomial with n=% 
(iii) Ha N significance level at a = 0.05 
a © use the test-statistic i 


papia 
degrees i i true, has an approximate y2-distributior wit 
binomial distribye i F2annibier of estimated parameter is 
on aS two s d 
parameters n an P 


specified (n=5) 
“We have to estimate the value of one P 


THE CHI-SQUARE DISTRIBUTION AND STATISTICAL INFERENCE 197 


_p from the sample data. Therefore the degrees of freedom = 6- 
1-1 =4 (`. k=6 categories). 
Gv) The critical region is x22 ae w) = 9-49. 


(v). Computations. To estimate the value of p, we first compute the 
mean number of heads, ¥. Thus 


g = Dee. AT L ag 
yf 1000 
, Theoretically, x = np, so that p = A = a = 0.494. 


Heace the expected (fitted) frequencies are the terms in the 
binomial expansion of 1,000(0.506 + 6.494), which are given in the 


column headed e;. Next, we calculate the value of x? as follows: 


Expected f, jane (0;-¢;)?/e; 


(e;) 


Nuraber 
of heads 


en f= = Lar 


(vi) Conclusion. Since the calculated value of x2 = 8.18 does not fall 
` jn the critical region, we therefore are unable to reject cur null 
hypothesis. We may accept the hypothesis that the distribution of 
the number of heads is a binomial distribution and conclude that 

the fit of data is good. 
Example 17.11. A skilled typist, on routine work, kept a record of 


mistakes made per day during 300 working day’. 


—— 


Mistakes per 
day (x) 


Number of 
days (f) 
Test the hypothesis that X ras a Pcissen distrib 
the %2 goodness-of-fit-test. 


a 


=tion by applying 


4 


198 

(i) We state our null : 
Ho: The population has a Poisson distribution With me 
unspecified, and 
H,: The population does not have a Poisson distribution, 


INTRODUCTION TO STATISTICAL 
T 


and alternative hypotheses as EORy 


an y 


(ii) We choose the significance level at a = 0.05. 
(iii) The test-statistic to use is 
2 k (o; -e)* 
yis h meee 


h e; 
i=l t 


which, if Hy is true, has an approximate %?-distribution with 
degrees of freedom = k-1—number of estimated parameters 


(iv) Computations. A Poisson distribution has one parameter H. To 


estimate the value of u, we compute the mean number of 
mistakes per day. Therefore 


eee IE _ yp 


Thus the fitted Poisson distribution is 


; e~0-89 (0.89) 
D(x; 0.89) = = fors = 0, 1, 2, ... 


We calculate the expected frequencies (given e~9-89 = 0.4107) which 
appear in the column headed e 


p and the value of x2 as follows: 
Mistakes Observed f, 
per day (x) (0;) 


THE CHI-SQUARE DISTRIBUTION AND STATISTICAL INFERENCE 199 
THE ee OU SNEONS LSAT ee e ee 


that the new final class corresponds to x23 with expected frequency = 
18.4, the corresponding observed frequencies are combined also. The 
Big g 3 (o; = e;)? 
test-statistic is then y2 = £} ———— 
i=6 

freedom is k-1—number of estimated parameters. Here k=4, the number 


a ‘The number of degrees of - 
l . ` 


of expected frequencies used in computing x2, and we have to estimate 
the value of one parameter, u from the sample data. Thus the degrees of 
freedom = 4—-1-1=2. 


(v) The critical region is x2 > a (2) = 5.99. 


(vi) Conclusion. Since the calculated value of x? = 10.01 falls in the 
critical region, so we reject our null hypothesis. We may conclude 
that the Poisson distribution is not a good fit to the data. 

Exau ple 17.12. Test the hypothesis that the following frequency 
distribution follows a normal distribution at & = 0.05. 


Intervals 10-, 12-, 14-, 16-, 18-, 20-, 22- 24-, 26- 


106 206 272 219 120 37 6° 


Frequency 4 30 


(i) We formulate our null and alternative hypotheses as 
Ho: The distribution is normal with mean p and variance 6%, and 
H: The distribution is not normal. 


(ii) The significance level is set at & = 0.05. 
(iii) The test-statistic to use is 
i=l i 
which, if Hy is true, has an approximate x2-distribution with 
(k—-1-m) degrees of freedom and where k denotes the rumber of 
intervals (after pooling, if necessary) and m represents the 
number of parameters estimated from the sample data. 

(iv) Computations. We first need to fit a normal distribution, but 
neither the mean p, nor the standard deviation O, is given. We 
therefore estimate jt by the sample mean x and © by the sample 
standard deviation s. Using the data, we find dfx = 19140 ana 
Efx? = 374688, so that ¥ = 19.14 and-s = 2.89. 

Next we need to compute the expected frequencies for all classes 
and the value of x2. The necessary calculations for expecter. 


, 200- 


frequencies, ejs (e; = nPop where p; is the estimate 
with the value of y2 are shown below: fp) w 


Expected 

fr 
eae 

e; (=np;) 


Upper class 
boundary 


Observe d 
frequency 


26 ; 37.6 


0 o) 8.9 
a e o a 


l 6 
1000 
There are 9 calss-intervals (no classes have been combined) ai 
we have used the sample mean, ¥ and the sample standart 


deviation, s to estimate the two parameters }t and 6, so tk 
- number of degrees of freedom is 9g-1-2=6. 


(vy) T Sage ere A 6 2 
he critical region is x? = %0.05,(6) = 12.59. 


‘wi i 

) ae aes Since the calculated value of x = 2.82 does not fil 

rath ie region, so we are unable to reject our 1 
esis and may conclude that the normal distributio 


rovi 
Ri i? a good fit for the given frequency distribution. 
0.3. Testing Hypothesis about Independence of tW 


Variables, Dusteticil ; 

ts fever ce dct can also be used to test the hypothesis we 

of categories or a each of which is classified into @ a i 

independence, w "iontas, Before we discuss the format of the ts 
, We give a brief description of the theory of attributes: 


17.6 Ae ASIDE - ATTRIBUTES 

e have be i | ) 

the act-al Mr aling with quantitative data obtained PY n e 

ide oe e of some variable character such as 33 he tiv? 
data may a so he sh i of the individuals or objects. 

ined if we simply note the presenc? ü 


Pe; 
INTRODUCTION TO stay, Y 


THE CHI-SQUARE DISTRIBUTION AND STATISTICAL INFERENCE 201 


qualitative characteristic and count how many do or do not possess it. 
The qualitatively distinct characteristics such as male or female, tall or 
short, satisfied or dissatisfied, high or low, healthy or diseased, positive 
or negative, etc. are called attributes. The attributes cannot be measurec. 
accurately but they can be divided into classes and their numbers in each 
class can be counted. If the data (i.e. population) are divided into two 
distinct and mutually exclusive classes by a single attribute as for 
instance, the population of human beings is divided into males and 
females, the process is called dichotomy (cutting in two). 


The capital letters A, B, C, .. are used to denote the several 


-attributes. The individuals or objectives possessing the attributes A, B, 


Cy ane -i conventionally designated A, B, C, „while the absence o£ 
these attributes by the Greek letters Q, B, Y Thus if A denotes that 


the individual or object possesses the attribute A, then & will denote that 
the individual does not possess the attribute A. Hence "g" is equivalent 
to "not-A". The attributes denoted by the letters A, B, C, ... are called the 
positive attributes and those denoted by %, B, Y +» the negative 
attributes. The letters are enclosed in brackets to denote the class- 
frequencies of the corresponding attributes such as (A), (B), etc., and 
they are combined to denote two or more attributes simultaneously. 
Thus, if A represents "intelligence" and B, "smoking", AB will represent 
"intelligence and smoking and (AB) the number of individuals who are 
intelligent and smokers.” Similarly (AB) will represent the number of 
individuals who are “intelligent but non-smokers” and so on. The order of 
the class-frequencies depends on the number of attributes combined 
together except N, the population size Which is regarded as a frequency 
of order zero. 


It is interesting to note that any class-frequency can always be 

expressed in terms of class-frequencies of higher order. For example, 
(A) = (AB) + (AB) 

i.e., the number of A’s is equal to the number of A’s which are B’s plus 
the number of A’s which are P’s. The frequencies of the highest order 
are termed as the ultimate class-frequencies. We can also express every 
class-frequency as the sum of certain of the ultimate class-frequencies. 
As for instance, we have 


(A) = (AB) + (AB) 
= (ABC) + (ABY) + (ABC) + (ABy) 


The frequencies on the right hand side are some of the ultimate 
class-frequencies, if we consider three attributes A, B, and C. 


INTRODUCTION TO STATISTICAL THeg 
RY 


202 n class-frequencY in terms of the known clas, 


jn order t py z tie class symbols aS operators and can mulii 
can tr 


ntities. Writing A.N for the opera Bly 


ies, W ua i i 
frequen het like atte ` attribute A, we can W rite nee 
e ot accord! 
of rignotomisinN A N= As 
4 di otomising N according to A, we get (A), the class frequency of 
ie. by 0C 
A. 
i a. N = (a) 
Similarly, ay a) 
Adding, we 8 ane (ive 
ara.N=N oS (A) + (o) 
ating N, we get a symbolic relation as A+ Q = 1. Hence we can 
seen and & by 1-A. It has been noted that these operative 
naan Phe ordinary laws of algebra- For example, we have 


(aB) gp .N=(-A) BN 
= (B- AB) .N = (B) — (AB) 


Similarly, (apy) = apy. N = a-A) 0 —B)(1-C).N 
-(0-4-B-C + AB + BC + AC — ABC) .N 


= N-(A)-(B)-(C) + (AB) + (BC) + (AC)-(ABC). 


v| z 


Example 17.13. Given that (A) = (4) = (B) = (B) = (0 == 


and also that (ABC) = (ay), show that 
N 
AABO) = (AB) + (BC) + (CA) -7 
We are given that (ABC) = (ay) 
But. (ay). = ay. N= (1-A) (1-B) a-0).N 
= (1-A-B-C + AB + BC + CA — ABC) N 
= N- (A) - (B) — (C)+ (AB) + (BO) + (CA)-(ABO) 
=N-—-—- 5 + (AB) + (BC) + (ca) - ABO) 
3 N 
(AB) + (BC) + (CA) -37 (ABC). 


Substituting these values, we get 


(ABC) = (AB) + (BC) + (CA) -È 


THE CHI-SQUARE DISTRIBUTION AND STATISTICAL INFERENCE 203 


a 

Example 17.14. 100 children took three examinations. 40 passed 
the first, 39 passed the second and 48 passed the third. 10 passed all 
three, 21 failed all three, 9 passed the first two and failed the third, 19 
Jailed the first two and passed the third. Find how many children passed 
at least two examinations? 


Let A, B and C stand for passing the first, second and third 
examinations respectively. Then we are given 


N = 100; (A) = 40; (B) = 39; (C) = 48; (ABC) = 10; 
(aBy) = 21; (ABY) = 9; (aBC) = 19 


We are required to find the number of students who passed at least 
two (i.e., two or more) examinations; that is we need the value of 


(ABy) + (ABC) + (ABC) + (ABC) 
Now (C) = (ABC) + (aBC) + (ABC) + (a BC) 
(ABC) + (&BC) + (ABC) = (C) - (aPC) 
Adding (ABY) to both sides, we get 
(ABC) + (aBC) + (ABC) + (ABy) = (C) - (¢-BC) + (ABY) 
= 48 — 19 + 9 = 38. 

17.6.1. Consistence. Any class-frequencies observed within one 
and the same population are said to be consistent with one another, 
because they conform with one another and do not conflict in any way. 
The necessary and sufficient condition for the consistence is that no 
ultimate class-frequency should be negative for the obvious reason that 
they are obtained by counting real attributes. Hence to test the 
consistence, we calculate the ultimate class-frequencies from the given 
data. If any ultimate class-frequency turns out to be negative, the given 


set of class frequencies will be inconsistent. The data are consistent, 
when all the ultimate class-frequencies are positive. 
Example 17.15. A market investigator returns the following data. 
Of 1,000 people consulted, 811 liked chocolates, 752 liked toffee and 418 
liked boiled sweets; 570 liked chocolates and toffee, 356 liked chocolates © 
and boiled sweets and 348 liked toffee and boiled sweets; 297 liked all 
three. Show that this information as it stands must be incorrect. 
Let A, B and C denote the liking of chocolates, toffee and boiled 
sweets respectively. Then we are given 
N = 1,000; (A) = 811; (B) = 752; (C) = 418; 
(AB) = 570; (AC) = 356; (BC) = 348; (ABC) = 297. k 


f INTRODUCTION TO STATISTICAL THEORy. 


imate class-frequency sh 
api sistence is that n0 uis Y should be 


The test of com 
negative j -apy N = G-A)G—-8)A-C).N 
oe < N-(A)-(B)-(C) + (AB) + (AC) + (BC) ~ (ABc) 


1000- 811 - 752 — 418 + 570 + 356+ 348-297 


= —4, which is negative. 
3} P e t: However if th 
i it stands, is not correc r, : 
o Ht the result of an actual enquiry in 
ro must have been some misprint or miscount or 
e 


Hence the i 
data returned are 
definite population, 
mis-reporting. 

17.6.2. Indep 


dence. Suppose in a population of size N, the class 
en s 
frequencies of two attributes A and B 


are given by (A) and (B). Then, we 


have 
a) 
the proportion of (A) = ~ °? 


B) 


the proportion of (B) N’. 
; (A) (B) 
the proportion of (A) and (B) combined = NON’ and 


F (A) (B) N= (A) (B) 
the expectation of (A) and (B) combined = N'N NON 


The two attributes A and B are said to be independent if the actual 
frequency equals the expected one, that is, if 


(A) (B) 
(AB) = w 


a) (B) í 
Similarly, œ and B will be independent if (aß) = : n , and so on. *R 


i : re 
case the ultimate class-frequencies on the two attributes A and B a 


A ; $ k 
given, the required criterion of independence for the two attributes 
and B will be 


(AB) (aß) = ae) , (op) 


| , -AP (@(B) | aB). 
| DE A at (AB) ( 
; 0 
Suppose the class-frequencies may be grouped into a table a 
-TOWS and two columns as follows: 


When the two attributes ar 
following form: 


| A [wom] 
(a)(B)/N 


17.6.3. Association of Attributes. The word association has a 
technical meaning in Statistics. In ordinary language, if A and B appear 
together fairly often, we speak of them as being associated. But in 
statistical usage, they are said to be associated only if they appear 
together in a larger number of cases than is to be expected if they were 
independent. Thus the mere fact that some A’s are B’s, however great 
the proportion, is not enough to conclude that A and B are associated. 
This is a fundamental principle and should always be borne in mind. 
Symbolically, A and B are said to be positively associated or simply 
associated, if 


(AB) > Ae) 


On the contrary, A and B are said to be negatively associated or briefly, 
dissociated if 


ap < AB) 


It should be remembered that dissociation does not imply independence. 


17.6.4. Measures of Association. The strength of association 
between two attributes A and B is measured by a co-efficient, called the 
co-efficient of association and defined by the formula 


q = (AB) QB) - (4B) (aB) 
(AB) (aB) + (AB) (aB) 


This co-efficient is due to George Udny Yule (1871—1951) and like the co- ; 
efficient of correlation, it lies between —1 and +3. Q=0, when th. 


INTRODUCTION TO STATISTICAL THEORY 


; lete associati = THE CHI- 
206 — ident. When there is a comp ion, Q=4 SQUARE DISTRIBUTION AND STATISTICAL INFERENCE 207 
irdepe the value of Q would be —1, As the corefficient is Positive 


i re k i 
attributes a Jete dissociation, T ee 


aaeh p ed another co-efficient e as the co-efficient in success, Le., the utility of coaching class is very great. 
opos associati ; . 
Yule has ae dlp measures the strength o ciation. It ig wie Contingency Tables. A table that consists of two or more 
igati : o : 
of colligat i mula orina fi pe columns, into which n observations are classified 
defined by the for & to two different criteria (or variables) is commonly called a 


Y= variables, say A and B where there are r disti 
(AB) (QB) A and c distinct classes B 2 eee classes Aj, Ao, ..., A, in 
1+ (AB) (aß) | a L 2 =» Be in B and the number of 
| 2; is oy with 
-operties and are related by the 
ient have the same proper 
These two co-efficien 


uation | 

j 2Y | 

TSZ 
andi red for a competitive 

le 17.16. 1,660 candidates appeare ; | 

Mert ani 422 were successful. 256 attended a coaching class and | 


ut successful. Estimate the vtility of the coaching 
of these 150 came o P.U. DSi Mea 


class. 
The utility of the coaching class can be estimated by finding the 


association between success and coaching class. 


Let us denote a successful candidate by A and a candidate attending 
the coaching class by B. Then we get the following data: 


- ECA 
A contingency table may be extended to higher dimensions. The 


se Aae aa) IB) SRG TE = ND, simplest form of a contingency table is the 2 x 2 table which is obtained 
We determine the other ultimate class-frequencies as below: | . When both criteria are dichotomised. The totals of the frequencies ‘in 
| each of the rows and columns are calle&é the margi 
x = = | ginal total or 
(AB) = AB.N = A(1-B).N | frequencies, Contingency tables provide a useful metkod of comparing 
= (A) - (AB) = 422 — 150 = 272; 3 two variables. 
ee a aoe ath, Fass | a 
Similarly, (4B) = (B) - (AB) = 256 - 150 = 106; | 17.7 TESTING HYPOTHESIS OF INDEPENDENCE IN 
and (aß) =N-(A)~ (B) + (AB) | CONTINGENCY TABLES 


= 1660 ~ 422 — 256 + 150 = 1132. The data presented in a contingency table can be used to test the 
Yule’s co-effici f E Pai | hypothesis that the two variables of classification are independent. If this 
cea association, Q is given by | hypothesis is rejected, the two variables of classification are nøt 

(AB) (aß) - (AB) (aB) | independent and we say that there is some association (or interaction) 

Q= between the two variables of classificatiou. To co so, we must calculate 


ids d AP) (aB) | the expected frequencies based on this hypothesis, keeping the marginal 
| r 
! = 150 x 1132- 272x106 140965 -0.71 totals fixed. 
: 150 x 1132 + 272 x 160 = 198632 “~ -| Let eij denote the expected frequency belonging to A; and B. 
Assuming the hypothesis of independence is true, the proportion of. 


ee 


< INTRODUCTION TO STATISTICAL THEORy 


some the same and equal to 
208 longing to any class A; should be 
bers belon : 
portions in the total. Th 
prop , 
Z z s hat e; = Aj) B) , 
tj isl =A sothat ey A 
` d 
Mal iy 
i=1 


. the classifications are independent, the expected 
that is, under Ho: i is equal to the product of the marginal totals 
ce 


frequency in any Jı divided by the total number of observations, 


hat ce : 
ee othesis of independence is true, the differences between 
If our hyp 


, re small and are attributed to 
xpected frequencies a : 
observed ram ane differences arise from Hy being false. The chi- 
Ae provides a means for deciding whether the differences 
e 5 . . . 
A ka or small overall. Hence the statistic to use is 


Spada 
ye y pe 


e 
isl 4 


A large value of x2 indicates that the null hypothesis is false. The 
number of degrees of freedom in a contingency table is obtained as 


2 


THE CHI-SQUARE DISTRIBUTION AND STATISTICAL INFERENCE -¥209 


(ii) Choose a significance level] a.T 
= 0.05, 0.01. l 


(iii) The test-statistic to use is 


he commonly used levels are at a 


Z £ (0;;—e;,)2 
x= yyy. 
i=]j=1 ij 
which, if Hp is true, has an approximate chi-square distribution 
with (r—1) (c—1) degrees of freedom. 


(iv) Compute the expected frequencies under Hg for each cell by the 


a re (Aj) (B) = {th Row Total) (jth Colum Total) 
y n Total number of observations 


Also calculate the value of x? and the degrees of freedom. 


(v) Determine the critical region which depends on a and the 
number of degrees of freedom. 


(vi) Decide as below: 


2 


Reject Ho, if the computed value of x72 > Xa,(r-1)(e-1) 


Accept Ho, otherwise. 


Example 17.17. Four hundred and ninety-two candidates for 
scientific posts gave particulars of their university degrees and their 
hobbies. The degrees were in either maths, chemistry or physics, and the 
hobbies could be classified roughly as music, craftswork, reading or 
drama. The data are presented concisely in the following contingency 


follows: 


In an rxc contingency table, there are in all rxc cells. If we look at 


the jth column of the contingency table, we see that when (r-1) of the 
values are determined, the rth value is automatically obtained as the 


marginal total is to remain fixed. Similarly, for the ith row, we see that 
when (c—1) of the values are determined, the cth value is determined 
from the known fixed marginal total. Thus there are (r—1) X (c—1) values 
that may Le determined freely, whereas the remaining re—(r—1)(c-1) 
values will be determined from the marginal totals. Hence the number of 


degree of freedom in an rxe contingency table is (r—1)(c-1), i.e., the 


product of the number of rows minus one and the number of columns 
minus one. 


The procedure for testing the null hypothesis of independence in — 


contingency tables is given below: 
(i) Formulate the null and alternative hypotheses as 

Ho: the two Variables of classification are independent, and 

A): the two variables of classification are not independent, }* 
they are associated, . 


Le., the degrees and the hobbies. 2 


table: ` 


Maths. Chemistry Physics 


Music 


Craftswork 


Reading 


Drama 


Discuss the association between the two criteria of classi-ication, 
(P.U., B.A./B.Sc., 1969) 


(i) We state our null and alternative hypotheses as 
Hp: The two criteria of classification are indeperdent, and 


H,: The two criteria of classification are not independent, i.e., 
they are associated. 


, INTRODUCTION TO STATISTICAL THEORY 

210 ` 
di) We 
The test- 


= .05. 
hoose the significance level at a = 0 
choo: 


statistic to use Ís 

tg ee 
y2=D Ds 

i=1j=1 


? 
ey 


(iii) 


is true, has an approximate X?-distribution with 


6 degrees of freedom. 


the expected frequencies ‘under the null 


late 
(iv) - DI apri for each cell by the formula 
ypo 


which, if Ho 
(4-1) (8-1) = 


B) (i ith Column Total) 
> (B) (ith Row Total) (jth 
eye $ =- 7 Total number of observations 


e following table: 


Physics: B} 


The expected frequencies are given in th 


Total 


Maths: B, 


17x124 _ 19.4 
492 


y2=54.06 


[HE SHi-SQUARE DISTRIBUTION AND STATISTICAL INFERENCE 211 
ee 


(v) The critical region is %2 > Koos (G) = 12.59 


(vi) Conclusion. Since.the calculated value of x? 


two criteria of classification. 
Alternative Procedure. When an autom 

available, a much shorter method for co 

contains the following three steps: 


atic desk -alculator is 
mputing x2, given by J. Skory, 


2 


7 L 
(i) Compute 2 ay for each column. Denote these by T). 


T. 
(ii) Compute £ —Ł. Denote this by R. 
j Bi) 
Gii) Then x? = (R-1)n. 
Applying this method to our Example 17.17, we get 


; (24)? (11)? (32)2 (10)2 
@ Ti -i24 + Gort isy + Gee 12.5691, 


(83)? (62)? (121)2 (26)2 
7a" i24 t grt 17 * T = 180.3600, 
_ (17)? (28)? (34)? (44)2 


3° "124 * 101 * 187 * go = 40-4748, and 


12.5691 180.3600 40.4748 
(ii) R = -£2091 , 180.3600 
77 ~*~ 2992 t 493 


Hence (iii) x2 = 492 (1.11 — 1) = 54.12. 
Example 17.18. Show that in a 2 x 2 contingency tabl: wherein 
; b 
the frequencies are T , the value of x? calculated on the hypothesis of 
independence is given by i 
jha (a +b +c +d) (ad — bc)? 
(a + b) (c + d) (b + d) (a + c) 
(P.U., M.A. 1966, B.A. Hons. Econ., 1969) 


We are given the following 2 x 2 contingency table: 


INTRODUCTION TO STATISTICAL THEORY 
ee THEORY; 


toad J 7 independence, We calculate the exnecteg 
is of 1n 


hesi 
1, 2) as below: 


a+b+c+d 


hypot 
Under the ar \ 
frequencies ey E =J gOtOU ee b) 
(a + c) (a + b) E25 a+btcetd. 

c) (a +o) 


eu=~arb+erd (b + d) (c +d) 


+c) (c +d) E27 a+bt+tc+td' 
Mr | 
2 2, ra 
Hence x? = pen eij 
i= Y= 


2 (b + d) (a + b)]? | 
oe peol | 


(b + d) (a + b) 
a+b+c+d 


(b + d) (c + d)}? 


~ proto 
a+b+c+d : 
Lerderay” [a 

Cc a+tbectd 

*@ra a+b) 

a+b+c+d 


a+b+c+d 
@ + d) (c +d) 
atbtct+d 


1 
(ad - be)? 1 e | 
oo EL O +d) (a+b) | 
1 1 | 


os —— eo 
taod) Gederd 


(at+b+ec+d)* tal | 


(ad — be)? 
eee ao 


“atbtc+d 
_ _(ad-bc)?(a+b+c+d) 
~ (a+c)(b +d) (a +b) (c + a) y 
(oy = ep in the 
This is a special case of y? given by x2 = 2 2 ei ap 
f co 
general contingency table. Hence this is a short-cut method o 
chi-square when a 2x2 contingency table is given. 


nd 250 wome? | 
Example 17.19. A random sample of 250 men a 


ision se 


: ; v 
were polled as to their desire concerning t}e ownership of tele 
The following data resulted: 


el 


THE CHI-SQUARE DISTRIBU 


Classification 


Want television 


Don’t want television 


Test the hypothesis that 


desire to own a television set is 
independent of sex at the 0.05 leve 


l of significance. 
(i) We state our null and alternative hypotheses as 


Ho: The two variables of classification are independent, and 
H: The two variables of classification are not independent. 
(ii) The significance level is set ata =0.05. 
(iii) The test-statistic to use is 


«= A 
y= A 
ij & 
which for a 2x2 table becomes 


(ad — be)? (a+b +c+d) 


x? = 
(at+c)(b+d)(a+b)c+d) 


This statistic, if Hy is true, has an approximate chi-square 
distribution with 1 degree of freedom. 


(iv) Computations. Substituting the values in the formula, we ges 


£ 2 
_ 500 (80 x 130 — 120 x 170) _ 40 = 13.33. 


2 
x 250 x 250 x 200 x 300 3 


(V) The critical region is %2 >? pg q) = 3.84. 


(vi) Conclusion. Since the calculated value of x? = 13.33 falls in the 
critical region, so we reject Hg and conclude that desire to own a 
television set and sex are associated. 


17.7.1. Co-efficient of Contingency for an rxc contingency 
table. The chi-square statistic shows only whether the sample data do 
er do not conform to the hypothesis. It does not tell anything about the 
Strength of the association, which we sometimes desire to measure. For 
this Purpose, Karl Pearson (1857-1936) has defined a co-efficient C, 
known as the Pearson’s co-efficient of mean-square contingency, by the 


'rèlation 


INTRODUCTION TO STATISTICAL THEORY 
an i THE CHI-SQU 
am ARE DISTRI l 
BUTION AND 
. = 2? g f STATISTICAL INFERE 
n (Jaa ~ be _2 2 
2 


a a pe 
GtHC+DGtn osm 


where n=a+b+ce4q 


NCE 215 


` i agicates sample size. i 
where n indica the strength of the association or 


m coat es of classification of a contingency table, i 
when "je is complete indepen? ence, C=0. When the two ‘en mi correction should be y 
classifications are perfectly associated, C = ` (k-1)/k , where k is the e is less than 10, sed if any expected PEA 
C lies between 0 and \(k-1)/k. In a 2x3 17.7.3 An Exact Test f. in a 2x2 \ 
frequencies in a 2x2 iste a 2x2 Contingency Tabl 


smaller of r and c. Thus 
e, the maxim 
f C, the stron 
known as Cram2r’s co- 


um value of C is \/(2-1)/2 = 0.707. The 


er is the association or dependence. 
: exact test, often called the Fish 


contingency tabl 
efficient of contingency, is 


. larger the value 0 
ə Another measure, 


:defined as a: 
2 


ge 
n (k- 1) 
eand k is the smaller of r and c. If | 


es the total sample siz 
dent, Q=0 and Q=1 when there is 


where n denot: 
y indepen 


the variables are completel 
perfect relationship. 
17.7.2. Yates’ Correction for Continuity. In applying x? 
we are required to combine the smaller frequencies (less 
But in case of two classes only, we cannot pool : | 
ger one. For such a situation, Frank 
kedly improved if 


approximation, 
than 5) with larger ones. 
the smaller frequency into the lar 
Yates in 1934 showed that the y?-approximation. is mar 


we usé the following formula: 


2 (jo;-ejl -3)? 
2- ZM 
> e o” 


n for continuity. It frequencies are fixed, is given b 
y 


This adjustment is known as Yates’ correctio 
should be used only when there is one degree 0 


frequency is small. 
We also know that the distribution in a con 


necessarily discrete but the ¥2-distribution is essentially continu 
approximation to x2 is just like the approximation of the discrete 
binomial distribution to the normal distribution, where a correction for 
continuity has already been discussed. So in a 2x2 contingency teble 


with small frequencies, the cell frequencies are adjusted by adding Phy 
es and keen 


f freedom and ont 
tingency table is 
ous. The 


the smalle 5 ing 3 i 
r and subtracting 5 from the larger frequenci 


the ~-2ginal tote’s un ; a 
T s unaltered. With this adjustment, the 


1 
formula fo" t 


gë (a+b)! (e+d! ato)! btd! 

Fa albic!d!n! i 
e: ng that d is the smallest frequenc; 

Pata by decreasing d by unity. ‘ 

teu g the process till d beconies 

served table and other possib 


observed 
"y value down to zero. Then th 


e we obtain other possible 
= anging other cell frequencies and 
; ero. We calculate the probabilities 
vd tee for all values of d from the 
Eaa oe ; otal probability, i.e. P = 
80, bal Fa a ee to one tail of the distribution ie is ches. abi 
Wa dohte 4 ability calculated from x. Thus for a tw nie 
, y ' s ” a two-sided test, 
Bete e probability so obtained, i 2 
eed, sig ~ i 
e reject our hypothesis Gita re ae 


p 


INTRODUCTION TO STATISTICAL THEORY 


act test to test the hypothesis 
unity from attack among a 
he following data: 


he Fisher’s €x 
nt of imm 
vent 


216 

Exampie 37.20. Use t > 
that inoculation js indepen! le 
jon exposed to 4 certain 


disease, 8} 


populat 


Classes 


Not attacked 


Attacked 
bility of observing the above 


We calculate the exact proba 
using 
+c)i(b +d)! 


contingency table, 
, (a + b)! (c + d)! (a 
p= al bic! d! n! 
2, Therefore probability for d=2, denoted 


The smallest frequency is d= 


Po is 
gi 12! 1317! 0.0477. 


P2 = 3110! 5! 2! 20! 
n of d is from 0 to 2, therefore the other two 


As the range of variatio 
possible 2x2 tables are 

2 6 8 

11 1 12 and 


of these tables ford = 1 and d = Oare 


Thus the probabilities 0 
1 12! 13! 7! 
8112113! T. _ 0,0043, and 


—— 


Pi > 9141! 6! 1! 20! 


O Saai 
Po = 314217! 0120! 0:000; 


The total probability P = 0.0477 + 0.0043 + 0.0001 = 
2P = 2(0.052) = 0.104, which is not negligible. 


0.0521. 


Hence we reject the hypothesis of independence. 


17.8 TESTING HYPOTHESIS ABOUT EQUALITY OF 
SEVERAL PROPORTIONS 


Contingency tables can also be used to test the hyp 
equality of several proportions (binomial parameters). Suppose we 
k(k > 2) independent random samples from k dichotomous (bi 
populations, where the ith sample contains n; observations, 


othesis about the | 


nomial) 
of which © 


THE CHI-SQUAR : 
E DIS 
show a certain Seen AND STATISTICAL INFERE ; 
racteristi NCE 
A be p; = — (cel © Say A. Let the proporti al 
i = >, (cell proportion t tion of characteristi 
i © column total) fori c 
L= 1, 2, k Th 
+ k. The data 


formi 
inga 2xk contingency tab] 
eare sh 


Propor- 
tion of A 


The hypot we wi 
hypotheses we wish to test, can be sta 
h Stated as 


Ho: Py = po =... =p, i h 
the k = k» Le. the pro . 

— Samples are drawn ee aaa 

€ proportions); and omly fro 


dy row A are equal (or 
Populations with the 


xX=5 (observed f — expected f)? i 
the summation bei peeing 
1 . 
hypothesis, the = over all the 2k cells of the table, U 
distribution with statistic is distributed al ee ; nder the null 
(2-1) (k-1) = (k-1) degre 7 as a %?- 
rees of freedom. 


A 2x2 conti 
oa a aY table maya 
proporti y also be used t ; 
ions p; and po, i.e. the ihe a equality of two 
0 : Pı = Po against 


H,: # 
1: Py y i 
Pea Po by performing the %?-test. An alter ti 
; ‘native formula is th 
e 


Zə — 2P 


p” te 


218 a it c: are the estimates of p and po, p jg in 
where P1=g406 

= b and ngo 
combined estimate, 21 ~ at 


2 
that Z? is exactly Xay 
Example 17.21 From the a 


a 
random samples of sizes one 
married and single men reco 


= c + d. It is interesting to Note 


(see Exercise 17.51) 


dult male population of seven large Cities, 
ted below were taken, and the numbers of 


Married 
Single 
Test the hypothesis, at the 0,05 level of significance that the 
es : 


i i e the same in all the 7 cities. 
proportions of married men ar ioy EA 


The test is carried out as below: 
o We set up our null and alternative hypotheses as 
Hp: The proportions p; of married men in all seven cities are the 
same, and . 
H:The proportions p; of married men in at least two cities are 
not the same. 
(ii) The significance level is set at a = 0.05. 
(iii) The test statistic to be used is 


.—e.)2 
y= Ziti 


where the summation is over all the 14 cells of the table ama 
which, if Ho is true, has an approximate %?-distribution with 
(7-1), ie 6 d.f. 


(iv) Computations. We calculate the expected frequencies under 
o for each cell by the formula 


qa Row Total) (Column Total)| 


n 


Thus the expected frequency for married men in city A k 
(169x980)/1274 = 130. The other expected frequencies ar 


calculated in a similar way. The expected frequencies for the 
cells in the table are given as follows: 


THE CHI-SQUARE DISTRIBUTION AND STATISTICAL INFERENCE 219 
f City A B oo r Pg 
130 


Married 170 150 110 160 120 140 


Single 39 51 
(0; — e;)? 
Now x? => — ~ 


» the summation being over the 14 cells, 


(133 — 130)2 = 2 
= 83-180? , (164 -1702 


(33-36)?  (36-42)2 
T eat =e , (86-42) 


42 
= 5.34 


(v) The critical region is x2> Xo 05,6) = 12.59 


(vi) , Conclusion. Since the calculated value of x2 = 5.34 does not fall 
in the critical region, so we are unable to reject Ho. In other 


words, so far as this test is concerned i : 
; , there is n 
against the null hypothesis. © evidence 


17.9 THE CHI-SQUARE TEST AS A TEST OF HOMOGENEITY 


The chi-square Statistic can also be used when the rows of a table 
which look like a contingency table, represent each a different sample or 


used to indicate the same or equal. The y2-test applied in such a 
sitvation is called a test of homogeneity. 


When there are two random samples, a simpler method proposed by 


Brandt and Snedecor is used to calculate the value of x2. This method is 
developed as below: 


Suppose we draw two independent random samples of size n from a 
Population and we wish to test whether the two samples are 
homogeneous. Let the values of the two samples be presented in the 
following 2xn table: i 


INTRODUCTION TO STATISTICAL THEORy 


20° i F 
2 ass samples are 1 Aci and the expected values e; in the 
iven by 6i F N ’ 
Je are give 


Be; 


dent, therefore the expected values e; in 
ndependen» 


the first samp 


a as 
Now X FAT a : K . 
bay [YBa siyan 
o; -220 = by ii : ' 
sD the i T oe 
i“ b: 
Dy l fi ate | ae 
a; t J-varp laa 
Thus x? aE ge ie L ó 
l W N 


N, (vb; = (cap?) 


aA B 

a (N) N2_2NA_y C- Ec; = N, Za;=A] 
nrt B 

aN) N2-N(A + B) -NA (put N=A+B) 
eni (a5) B. 


where either row can be chosen as the a;’s. 


of 
i . =| degrees 
This is known as Brandt-Snedecor formula with n 

freedom. The rest of the procedure is the same. 


f 50 
. mple 0: 
Example 17,22, Ina certain’ community, a random sa asked 


were 
men and another sample of 50 women over 21 years of age high 


sar hj 
; ‘anior high, senio! 
about their educational background, classified as junior high, 
or college. The results are: 


THE CHI-SQUARE DISTRIBUTION AND STA 
— HN AN 


TISTICAL INFERENCE 221 
Junior High Senior High College 
13 


25 


Gi) We ce 


homogeneous. 


(ii) The significance level is set at a = 0.05. 
(iii) The test-statistic to use is the Brandt- 


freedom. 


(iv) The critical region is %2 > X9 05,(2) = 5.99 C7 n=3) 


(v) 


Computations. We calculate the value of %2 


Junior High 


tee [se a 


2 

a; 2 

Now x? ete 
1%‘ 


as follows: 


N 


2 2 2 
+ CF , a2? (502] 


pened Ae 


_ _(100)2 [ a3? 
ü 36 45 19 100 


(50) (50) 


= 4 (4.69 + 13.89 + 7.58 — 25) = 4 x 1.16 = 4.64 
(vi) Conclusion, Since the calculated value of x2 = 4.64 does not fall 
in the critica] region, so we are unable to reject Hy. We may 


Conclude that the two samples (groups) are homogeneous in 
respect of educational levels. 


UCTION TO STATISTICAL THE 
INTROD HEORY 


2 EXERCISES 


e random variable and its density function. 
a chi-squal' 
(a) Define 


2-distribution. 
ja he important properties of %“-distri 
; r 
(b) Discuss t% ibution of the sum of squares i n in dependent 
se srt each of which is distributed normally with 
, f variables, 
el pnas and unit variance. 


2.distribution tends to normal distribution 


(b) Show that the % (P.U., M.Sc. 1979) 


for large degrees of freedom. 


1 Sa - p)? is distributed as X? with n degrees of 
17.3 Show that = - i 

i as (X; — X)? is distributed as x2 
freedom. Explain why = 2 i 


with (n-1) degrees of freedom. 


Show that for large n, 2X? is normally distributed about 
(a) Show ý 


17.4 i 
mean \/2n ~ 1 and with variance unity. 


(b) Compute Yos for 40, 60 and 105 degrees of freedom by E 
(i) Fisher’s approximation (ii) Wilson-Hilferty approximation 
and compare the values with the table values. 
17.5 (a) Explain how you determine a confidence interval estimate of 


o? of a normal population. 


Á : le 
(b) Given that X is normally distributed and given the paser 
values t = 42, S = 5 and n = 20. Find the 98 p 


confidence interval for o2, i 
, EES so 
17.6 (a) The following are the volumes, in deciliters, of 10 can 


45.8, 
peaches distributed by a certain company: 46.4, 46.1, . 


47.0, 46.1, 45.9, 45.8, 46.9, 45.2 and 46.0. Find a yi 
confidence interval for the variance of all such ae be 
peaches distributed by this company, assuming volume 1986) 
a normally distributed variable. (I.U.M.Sc., 


ci oap, 
(b) The contents of 10 similar containers of a cee n 9.8 
are 10.2, 9.7, 10.1, 10.3, 10.1, 9.8, 9.9, 10.4, 10.3 a Ae 
litres. Find a 95% confidence interval for the variance 


; i 1 norma 
such containers, assuming an approximately 
distribution, 


CAL INFERENCE 223 


Sample Size: 3, 5,7 
Sample Mean: 42, 45, 40 
Sample Variance: 25, 16, 9 


Obtain a 95% confidence interval for g2, 


Obtain a pooled estimate of g2 


and use it to find 90 per cent 
confidence limits for g2, 


M.Sc., P.U., 1989, ILU., 1993) 


17.9 (a) Explain how you would test the hypothesis about variance of 


a normal population. 


Test the hypothesis that the proces: 
4 (inches)? at the 5 per cent level. 


17110; (a) A sample of 25 observations has S2 = 12.6, would you accept 
or reject at the 5% level of significance the hypothesis that 
0?=20? Also compute a 90% confidence interval for o2, 
(P.U., B.A/B.Sc. 1992) 
(b) A’standard examination has been given for several years with 7 
H=70 and o2=9, A school using this examination for the first 
time, gave it to a group of 25 students who obtained a mean 
X=71 anda variance of S?=12. Is there reason to doubt that 


the score of all students in the school would have a variance 
of 9? 


17.11 (a) A random sample of 15 have the following values: 


10.21, 9.72, 10.13, 8.89, 10.20, 9.65, 10.02, 10.00, 9.45, 10.11, 
8.97, 10.21, 9.36, 9.55, 10.23. Test the hypothesis that o? = 


0.12 against o2 > 0.12 at (i) 5% and (ii) 1% level of 
significance, (P.U., B.A./B.Sc. 1989) 


INTRODUCTION TO STATISTICAL THEORY 


ont distribution has ê variance O? of (2.792)2. Do the 
A 10 values, selected at random, have a sreater 
d? 67.50, 70.75, 72.90, 63.25 


eset m6, 69.25 68.50, 66.50 and 64.75. (I.U., M.Sc., 1986) 


65.25, 68.75, 
sohts of 8 random sample of 10 boxes of a particular 
8 13.7, 14.1, 14.3, 14.1, 13.8, 14.4 


14.3. Test the hypothesis that Hy : o2=0.02 


13.9 and ; 
14.8, H, : 0? < 0.02, using a 0.01 level of 


against the alternative 


significance. 

r of car batteries claims that the life of his 
standard deviation equal to 0.9 years. If a 
batteries have a standard 


that © > 0.9 years? Use a 


(b) A manufacture 


batteries have & pa 
random sample of 10 of these 


deviation of 1.2 years, do you think 


0.05 level of significance. 
17.13 (a) Describe how you would test the equality of k(k>2) variances 
of normal populations. (P.U., M.Sc. 1989) 


(b) Show that the estimates 3.8, 4.4, 8.1, 6.1 and 9.4 of the 
population variance, based on 5, 8, 6, 7 and 4 df. 
respectively, may be regarded as homogeneous. 

17.14 ` (a) Describe Bartlett’s test for homogeneity of variances. 


(b) Three independent samples gave the following results: 


Observations 


34, 40, 47, 60, 84. 


40, 59, 60, 67, 86, 92, 95, 98, 108. 


46, 93, 100. 
Use Bartlett’s test to test the hypothesis of equal variances. 
Let &=0.05. (I.U., M.Sc. 1994) 


17.15 (a) Six samples of size 5 each, have the variances 10.4, 13.8, 11.7, 
19.3, 16.4 and 15.8. Test the hypothesis of homogeneity of 
variances by Bartlett’s test. (P.U., M.Sc. 1988) 


(b) For the data given below; 
Sample 1: 4, 7, 6, 6; 
Sample 3: 3, 8, 6, 8, 9, 5. 
Use Bartlett’s test to test the nypothesis that the variant 
three populations are equal. (a =0.05) (P.U. M.Sc. 1 


Sample 2: 5, 1, 3, 5, 3, 4; 


es of 
995) 


THE CHI-SQUARE DISTRIBUTION AND STATISTICAL 
INFERENCE 
225 


17.16. A random selection of nin 
out patients clinics across t 
and variances of 10 s 
amples were 9 
and 26. Test the hypothesis that ai : 


n objects are classified 
probability for the ith group being p 
i and the ny 


it being n; (i = 1 
i > 2, as ). What ; 
| n;i? Show that as n ten ds = is Pa joint distribution of th 
Infi e 


17.17 à 
independently 


>» ——— tend 
np; S toa X?-distributi ; 
on with (k-1) q 
egrees of 


freedom. 
(P.U., M.A. Stat., 1964) 


(a) Describe three dist; 
istinct use 
According t : S of the chi- re distri 
should p a es the proportion in neon 
i * 4pq : q? wher in three groups 
consistent with the sample 9 Mi, Jove q = 1. Are the dats 
ESE) P = 0.4? 


(P.U., B.A./B.Se, 1992) 


17.18 


(b) The proportion of indi 
of individuals i 
pees be in the Proportion ae 
where p+q+r= i ‘ 
nr pa : ; 1, Given the observed frequenci 
j or compatibility with p=0.4 q=0.4 i, #60, 
“4, q=0.4 and r=0.2 


17.19 (a) Genetic theory states 
blood t 
ype M and the other of blood type N will always be of 


vaada of eh of type M, 42% of type MN and the 
the truth e type N. The low value of x2 demonstrates 
of the genetic theory." Calculate thre value of x2, - 


make the appropria S 
late test of i i he 
i kit RPT ont 
sions quoted F Us B.A. B «3G 1973) 


(b) A machine i 
ond — k pee to mix peanuts, hazelnuts, c>shews, 
ieee ie y the ratio5:2:2:1. A can containing 500 of 
laleur 4 nuts was found to have 269 peanuts, 112 
alpine a i cashews, and 45 pecans. At the 0.05 level of 
eosin. A est the hypothesis that the machine is mixing 
e supposed ratio. (P.U., B.A/B.Sc. 1988) 


17.20 
“~ A thou sae 
Sordin ens individuals were class‘fied according to sex «nå 
& to whether or not they were cclour-blind as follow. 


_ a 


INTRODUCTION TO STATISTICAL THEORY 


Classes 


Normal 


ur Blind ane fe 
= he genetic theory, the frequencies in four diss, 
e 


According tO t 


‘should be Pop ane 
4% 1% 
i ith theo 
i he data are consistent wi ry. 
Test the hypothesis that the Oe Bama x 


a coin, 115 heads and 85 tails were observed, 


17.21 (a) In 200 tosses hte that the coin is fair, using a level of 


Test the hypot 
significance of 0.05. . | 
of dice, 74 "sevens" and 24 "elevens" are 


(b) In 360 tosses of a palr 0.05 level of significance, test the 


observed. Using a 
hypothesis that the dice are fair. 
ibuti i reported in a newspaper was 
istribution of 98 births repor 

ee ie Pr 46 girls. Is this consistent with an equal sex 
division in the population? Use the ae aioe er 

normal approximation. (I. i i : 
(b) In a certain disease with 40% mortality, of 10 sana he 
certain treatment only one dies. Is the treatment > aea 

5% level of significance? (P.U., M.Sc. 


y "iths at 
17.23 (a) The following table records the observed number of bri 
a hospital in four consecutive quarterly periods. 


Jul.-Sep.,  Oct.—Dee. 


Jan.—Mar., Apr.—dun., 


Number of 
births 


It is hypothesized that twice as many babies are born pn 
the Jan.—Mar. quarter than are born in any of the © ly 
three quarters, At œ=0.10, test if these data strong 
contradict the stated hypothesis. 


(b) The grades in a statistics course were as follows: 


THE CHI-SQUARE DISTRIBUTION AND STATISTICAL INFERENCE 


Test the hypothesis, att 
distribution of grade: 


[j 


he 0.05 level of si 

7.24 (a) A random number table of 250 digi 

1 distribution of the digits 0, L2. I sh 
Digit :0 1 2 3 
Frequency:17 3) 29 18 


Test the hypothesis, at 0. 
digits were distributed in e 


owed the following 


4 5 
14 20 
05 level of signifi 


6 7 8 9 
35 30 20 36 
cance, that the 


qual numbers in the table. 
(b) The following distribution s 
overdoses of narcotics. U 
hypothesis that equal nu 


se chi-square Statistic to test the 
mber die in all age-groups. 


(LU., M.Sc. 1986) ` 


17.25 ing results: 


(a) A die is tossed 180 times with the follow’ 


Is this a balanced die? Use a = 0.01. 


(P.U., B.A./B.Sc. 1990) 


(b) The following figures show the number of births in an area 
over a year by months of occurrence. 


January $0759 May 51371 September 52142 
February 46472 June 47388 October 50824 
March 51419 July 49995 November 47768 
April ` 49670 August 51043 December 51129 


Use the x? test to discuss whether there is any seasonality in 
births revealed by these data. (P.U., B.A.,/B.Sc. 1963) 


17.26 (a) Discuss the x2-tests of goodness-of-fit. What are the 
assumptions in the application of these tests to practical 
problems? 


(b) Records taken of the number of male and female bitas ia 
800 families having four children, are as follows: 


290 


178 


INTRODUCTION TO STATISTICAL THEORy 
228 Test whether the data are consistent with the hypothesis that 
est ` 


the binomial law ho 


-equal to that of a female bi 


: 1 
rth, that is p=q=9- 
(P.U., B.A/B.Sc. 1976, 93) 
; -of-fit-test? Describ 
: urpose of the goodness-o escribe 
17.27 (a) ee Lane E ere this test might be used appropriately, 
three sit 


Three six-sided dice were thrown 648 times and the number 
b r - i 
7 of 5’s or 6's noted at each throw. 


Number of 5’s or 6's | 


Number of throws 


Test the hypothesis that the data conform to binomial 
‘ 1 a = 
distribution with p = zand n=3. Let a=0.05 
(P.U., B.A./B.Sc. 1983) 
17.28 Twelve dice were thrown 4096 times and a throw of 6 was 


reckoned as a success. The observed frequencies were as given 
below: 


No. of success} 9 1 2 3 4 5 G 7&over Total 
[Frequency 447 1145 1181 196 8 4096 
Find the value of x2 on the hypothesis that the dice were 


unbiased and hence show that the data are consistent .with the 
hypothesis so far as the x?-test is concerned. (P.U., B.A./B:Sc. 1975) 


17.29 (a) Suppose that 6 coins are tossed simultaneously 640 times and 
the following frequency distribution is observed: 


No. of heads 


Frequency 18 70 137 210 145 


Test the null hypothesis that the coins are well-balanced. Use 
a=0.01. (P.U., B.A./B.Se. 1992) 


(b) When the first proof of a book containing 250 pages was read, 
the following distribution of printing mistakes were found: 
No. of mistakes 
per page 


Frequency 


E f e 
Fit an appropriate distribution to the data and test th 
goodness-of-fit. 


Jds and that the chance of a male birth iş - 


ThE CHI-SQUARE DISTRIBUTION AND STATISTICAL INFERENCE 
Ss 
17.30 Given the data 


Fit a Poisson distribution and test the goodness-of-fit 


(P.U., B.A/B.Sc. 1982) 


47.31 may be regarded as 


Test whether the data given below 
conforming to a Poisson Distribution? 


x 
OEE 


365 210 80 23 9 2 


17.32. The wages of 1,000 employees range from Rs. 4.50 to Rs. 19.50 


They are grouped in 15 classes with a common class interval of 
Re. 1, and the class-frequencies, from the lowest class to the 
i highest, are 6, 17, 35, 48, 65, 90, 131, 173, 155, 117, 75, 52, 21.9 


6. Fit a normal distribution and apply the chi-s 
-square p 
of-fit test. q goodness 


17.33 The heights of 200 employees are distributed as follows: 


Test whether a normal distribution gives a satisfactory fit to the 
data at a =0.05. 


17.$4 (a) Define attributes and ultimate class-frequencies. 
(b) Compute the ultimate class-frequencies from the data given 
below: 
N = 500, (ABC) = 240, (aPy) = 25, (AB) = 18, (By) = 98, 
(Y) = 125, (A)-(a) = 80 and (B) - (B) = 200. 
17.35 


In a class of statistics, there were 300 students. Their results in 
the First Terminal, Second Terminal and | the Annual 
Examinations were as follows: 


"120 passed the first terminal, 112 passed the second terminal 
and 144 passed the annual examination, 38 pavsed all the three, 
39 failed in all the three, 44 passed the first two and failed in the 
annual, 63 failed in the first two but passed the annual 
examination." i 


i -aminations? 
Find how many students passed atleast two examinations: 


INTRODUCTION TO STATISTICAL Ty | 
230 J Eqry | 


‘ CHI-SQUARE DISTRIBUTION AND STATISTICAL INFE 
Define the term Consistence. State the necessary ana THE CP RENCE m” 
EN sufficient conditions for the ja oa of a set of j j= (AB) (aß) + (aB) (AB) ; 
| Gian k ea = 197g) (AB) (aB) - (aB) (AB) 
(b) Certain date obtained from 3 study of a group of 1o | o REY NMC 


subscribers to 4 certain magazine relating to their 
marital status and education were reported as tallow 
312 males, 470 married, 525 college graduates, 42 male 
college graduates, 147 married college graduates, 86 married aah. mak 
males and 25 married male college graduates. a ee ap 


Show that the numbers reported in the various groups are 7 Q+V2-0-Y? Up g 
not consistent. | or o a EEL Lon ) (aß) - (aB) (AB) 


} (1+ Y)? + (1-¥)2 (AB) (aß) + 
r ‘ j aB 
17.37 When are two attributes said to be independent, positively ; 4Y B) + (0B) (AB) 
associated or negatively associated? Discuss the association when 21 + Y) Q 
(AB)=256, (QB) = 768, (AB)=48, (aB) = 144. (P.U. B.A/B.Sc. 1917) 2Y i 
17.38 (a) When are two attributes independent? Describe different Pines = 1+ Y2° 
froms of the Criterion of Independence. 


ame) (aB) (AB) 
Squaring, we obtain 


or 


17.40 (a) Define a Contingency Table. How do you determine the 


E (b) Show as briefly as possible whether A and B are independent, number of degrees of freedom in an rxc contingency table? 


positively associated or negatively associated in the followi ; : x : 
ss 8 y wing | (b) What is the chi-square test for independence? Describe the 


i situations where this test might be used appropriately. 
(i) N = 5000, (A) = 2350, (B) = 3100, (AB) = 1600; 3 a 
d) A) 1741 (a) State the important applications of %?-statistic. 
l = 49 = = = . . 
A 0, (AB) = 294, (a) = 570, (A.B) = 380; (b) The following table shows the number of recruits taking (i) a 
Gii) (AB) = 256, (&B) = 768, (AB) = 48, (aß) =144. . . preliminary and (ii) a final test in car driving. Use the x? test 

: (P.U., M.A. Econ. 1974) to cern Lar fat te soiien between the `s 
; ; results of preliminary and those of the final test. 
17.89 (a) ax a by Association of Attributes? How is it | p y à 

asured? 


Preliminary 


605 135 

i 195 65 
800 200 ` F100 | 
11.42 (a) Test the association between injection against ear E 
exemption from attack from the following contingency table: 


[Classes | Attacked Not-attacked [Total 


Inoculated 528 
790 


(b) Pr on es i 
) Prove that Q Ty ya? Where Q and Y are the co-efficient of 


association and colligation respectively 
Solution. (b) By definition 


1—4 [@B) AB) i 
mp ae =o = (4B) (aB) - iB) (AB) 
5 T se Vaz) (aB) + V&B) (AB) 


Inverting, we get 


Not-inoculated 


> a 


| 


INTRODUCTION TO STATISTICAL THEopy THE CHI-SQUARE DISTRIBUTION AND STATISTICAL INFERENCE 
| ug and half of them A 233 
232 `= table gives the census data of orchards, Tes | paran ts he INER aa Elven sugar pills, The patients 
(b) The following that the two variables of classification are | Test the hypothesis that th a recorded in the following tabl 

the hypothesis | . e drug is no better than sug, ee 
independent. : TATE j curing colds. Let @=0.05. gar pills for 

gory Hel 

Yielders 205 = elped 
High Yielder i rug 52 
195 


17.46 A thousand households are taken at rando iyi : 
three groups A, B and C, according to the bas set a into 
The following table shows the numbers in each grou y as 

a=0.05. | colour television receiver, a black and white se having a 

television at all. iver, or no 


Find y? and test whether the two attributes are independent. Let 
17.43 Em 


. 


Colour television 


Black and white 
(P.U., B.A./B.Sc. 1960; Opt. 1969; 72) ` 


the null hypothesis that the two variables of 7 - — 
17.44 (a) pee ae " msentend using a 0.05 level of ` Test the hypothesis that there is no association between total 
ae a i income and television ownership. 
signifi k ; ; 


B 


1 
Bo 


(P.U., B.A/B.Sc. 1961; M.A. Econ. 1985) 


(b) The following is percentage distribution by income level and 
ownership of a random sample of 400 families in the city of 


| 
| 
| 
Lahore. , | 


None 


17.47 Calculate x? (chi-square) from the following contingency table of 
attributes and test for independence at a=9.01. 


Attributes 


(C.S.S., 1965) 


17.48 Gilby classified 1725 school children according to inteltigence and 
Less than Rs. 12,000 to More than apparent family economic level. A condensed classification 
Rs. 12,000 Rs. 60,000 Rs. 60,000 


| follows: ; 
-[Home Owner | , 
Renter 25% | ` 
j 


Very well.clothed 
Well clothed 
Poorly clothed 


: f th 
Test the null hypothesis of independence : 
classifications ut the 0.01 level of significance. 


4 
! 
| i —_ 
\ 


Income Level 


Test the hypothesis that the home ownership is independent 

of the family income level, using 1% level of significance. | 

(P U., M.Sc: 1989) | 

17.45 A certain drug is claimed to be effective in curing colds. ae 
experiment on 164 peop!: with colds, half of them were given 


e two 


ç CHI-SQUARE DISTRIBUTION AND STATISTICAL INFERENGE 
1H 


INTRODUCTION TO STATISTICAL THEORY 235 


234 
i to determine whether a poli 

An insurance company wants i 

17.49 tler’ age is independent of whether or not the policy holder 

has filed an accident claim. A study of 1000 of its policyholders 


gave the following results: 


Under 25, 25-40, 40-55, Over 55 


b = nop, 


na (I~ Pa)yry = np and ry = ng, 


a a ro 
: L= a oak so thata = nip, 
d 


c= ny (1 -p:), 


2 _ n (ad — bo)? 
Now X = “nynoryre 
Reported claim n [nynopy (1 — po) - No Do (1 -p))2 
i ny. ng. np.nq 


_ mP (1 Po) -p -p - Mite? — Po)? 


| No claim 


Test the hypothesis at 0.05 level of significance that the claim 
status is independent of the policyholder’s age. 


; : n 
17.50 A random sample of 200 married men, all retired, were classified A A a ngp 
according to education and number of children. Dı -= Dp»)? 
ee ce where n = n; + ng 
Number of children Ries. 

5 A n 

Education ra 

Pi-Po 

Elementary = Z2, where Z = ! 


Secondary 


— That is, x? is a square of Z-statistic for testing the equality of two 


proportions. Hence, in a 2x2 contingency table, the two test- 
procedure are equivalent. 


17.52 Given the following contingency table for hair colour sini be 
colour, calculate the co-efficient of contingency and interp 


| 
| 
fference between the two proportions, | result. 


Test the hypothesis, at the 0.05 level of significance, that the size 
of a family is independent of the level of education attained by 
_ the father, (I.U. M.Sc. 1985, 91) 


| Iod Show that a chi-squared test for a 2x2 contingency table is 


equivalent to testing the di 
using the normal approximation. (LU., M.Sc. 1993, 94) | Hair Colour 

p , : ey rown 
Solution, Let the 2x2 contingency table be | Eye Colour Fair Grey Br 


Blue 
| Black 
Dark Blue 


(P.U., B.A./B.Sc. 1984) 


' correction for continuity. 
ynang 17.53 (a) Explain the use of Yates’ correction fol 


i ion, 255 had not 
(b) Out of a group of 320 people exposed to infection, ¢ 


p disease. Of 
been immunized, and of these 95 contracted the 


M1 4+ Ge =) ed, p it 
15 were infected. Does ! 


1060, ty e+ dandy = oak ead 
Suppose the two 


Proportions are denoted D D. ‘ been immunized, : inst infection? 
estimate of the population Proportion is da vat wean ane | ho ma 4 a gave any protection sep» atthe 
a aia y p. Then | ae ifference in the significance 0° © applied? 
Py= n,’ P2=—; 1 -p mA 1-3 d What is the differ , correction is or iS not, aoe | 
j A ng’ ' X? test according as Yates cO (P.U., B.A/B.Se. 1 


ON TO STA 
236 ` INTRODUCTI TISTICAL THEORY 


17.54 (a) Prove that x? for the table 


Me | ot | 


if given by 


2 

n (lad -bė -2) 
Dis I ee 
x" = (a+b) (+d) (ate) (b+d)’ 


if ad < bc and wheren =a +b+c+d. 


(b) A random sample of 30 adults is classified according to sex and 
the number of hours they watch television during a week. 


9 
7 


Over 25 hours 


Under 25 hours 
Using a 0.01 level of significance, test the hypothesis that a 
person’s sex and time watching television are independent. 
(P.U., B.A/B.Sc. 1986-S) 
(a) Describe Fisher’s exact test for-a 2x2 contingency table. 


(b) Suppose that a number of patients were treated for cancer 
with results as in the following table: 


17.55 


Tumer Regression 
Toxity Present 


| Yes No | 
5 2 
1 7 
exact test to test the independence. 
(P.U., M.Sc. 1984, 86) 


ww after fractured neck of femur in a 
paedic ward (A) and a general ward (B) are 


Use Fisher's 


(c) Deaths in 6 mon 
Specialised ortho 
given below: 


TH 


17.56 


ç CHI-SQUARE DISTRIBUTION AND STATISTICAL INFERENCE 237 


Test the hypothesis of independence by using the Fisher-Irwin 

exact test. 

(a) The following data are intended to show dependence c? 
prittleness in polyethylene bars on the duration of heat 
treatment at a particular phase of the manufacturing process. 


pT Brite Tough | 
Treatment 1 2 8 

Use the Fisher’s exact test to test the null hypothesis that the 
brittleness of polyethylene bars does not vary with the two heat 
treatments. : (P.U., M.Sc. 1989) 
(b) Use the Fisher’s exact test to test the hypothesis that 


inoculation is independent of immunity from attack among a 
population exposed to a certain disease from the following 


data: 
oo [| 2 O 
6 


(I.U., M.Sc. 1995) 


17.57 (a) In a study to determine whether or not the proportions of 
defectives produced by workers was the same for the day, 
evening, or night shift worked, the following data were 


collected: = 
Shift 
Evening Night 
- 55 70 
Defectives 390 870 


Non-defectives = 
Test the hypothesis, at the 0.025 level of cae A 
the proportion of defectives is the same for all three s ; 


i results of 
(b) There are 5 classes, each havirg 50 students. The res 
these 5 classes are given below: 


Pass 


Failure 
1 failure, usiiig totals. 


; 
4 ase: i 
Ho: 4P gency table are 


Test the null hypothesis Ho : sth conti 
17.58 The observed row frequencies 1m a tation n; = 4i t b; for 
: dba, bo «or On: Using the no i ac site Sill 
Oj, Ag, «+, Ap “m r A, Lb; = BandA + alia : 
all į (= 1; 2; a HD) 28 7 ressed as 
the x? for testing independence danie 


TRODUCTION TO STATISTICAL THEO 


What are the degrees of freedom for the chi-square? 


Use this formula to find x? from the following data: 


No. of Rows 
of Kernels 


17.59 (a) Prove the Brandt-Snedecor formula for x2. (1.U., M.Sc. 1986) 


(b) Two groups of freshmen applying to enter a university took 
the same college aptitude test. The groups (A and B) differed 
in the type of high school education they had experience,’ 


The frequency distributions of scores for the two 
as follows: 


| 0-9 [10-19 20-29 | 30-39 | 40-49 
71°] 68 | oo | a7 
2 | 8 | i | qe 


groups were 


Calculate the value of 2 


ale and determine whether there is a 
significant difference in colle 


ge aptitude test between the groups. 
(I.U., M.Sc. 1985) 


Teacher- 
instructed 


Machine- 
instructed 


Calculate %2 


and determine 
difference in a 


chievements of th 


whether there is a significant 
e two groups. 


(M.Sc. P.U., 1989; I.U., 1992) 


| 


The Student’s ŁDistribution 
and Statistical Inference 


CTION 
18.1 INTRODU l 
Earlier it was shown that if we take a random sample from a nor m 
‘Ear i > 
i ith mean p and variance O 2, the sample mean X is normally 
ani 2/n; that i zn is a standard 
‘i ; that is 
distributed with mean p and variance O*/n; aha N 
ize i rovi that 0*4 is 
normal variable even if n, the sample size is smil Le ele es 
known. But in actual practice, the population sss pe ain 
known and is estimated from the sample ania: oie Sieh ats 
2 is replaced with its un 
small (n<80) and O% is rep 
statistic 


t= Na 


test 

‘ how can We w> 

a ais is being the case, 10 ? This 
~is no longer normally distributed. wath the confidence intervals? 


do in 1908 
hypotheses about the means and _ 1937) who ; 
a was solved by William Sealy Lowey Mean", in which he 

à “i 7 ble Err “ty 
published a paper titled, "The Proba X-- ws, Gosset, an 


MEE : f t= 
discussed the sampling distribution © s/N" 


«v, pub 

blin brewery, P 
a Du "Student", as 
his researc 
=U ater his pen 


istic t = s/n 


Jished a series 
he was not 


employee of the Messrs. Guinness, h under his 


name 
of scientific papers under the w n to publish 
Permitted under the company Polcy i 
2 t 
real name. The distribution of the sta l 
’s t-distr! 
name (Student), is known as Student 


239 


240 INTRODUCTION TO STATISTICAL THEORY 


did not complete the mathematica] proof, but he reached the Correct 
decision about the sampling distribution of the statistic £. A rigorous 


proof was provided by Sir R.A. Fisher (1890-1962) in 1925. The ¢.- 


distribution has a single parameter known as the number of degrees of 
freedom. The t-distribution is of great importance in the so called small 
sample tests and is profoundly used in statistical inference. 


18.2 THE STUDENT’S t-DISTRIBUTION 
Let Xj, Xo Xn be a random sample from a normal population 


5 1 
with mean u and unknown variance o? and let x = a > x; and 


s?=1_S(e,-#)?, which is the unbiased estimate of 0%. Then the 
n=l 


sampling distribution of the statistic 


Tm. 

s/n 

is called Student’s t-distribution with (n-1) degrees of freedom. It is 
interesting to note that the statistic t may be writen as 


=n 

pak Roh: Bae. 
= OG 1 x, - 2) U 
n ns Ee) per 


x- 


where Z = maa is a standard normal random variable and 


N2 
xj- -1)s? 

a a 
degrees of freedom. Thus the t-statistic is the quotient of a standard 


normal variable and the square root of a chi-square random variable 
divided by its degrees of freedom. 


To find the distribution of t, we proceed as follows: 


, eee a a standard normal random variable and U, a x2-random 
variable with v (Greek letter nu) degrees of freedom. If Z and U are 
independent, the:r joint density will be 

fz, u) = e7 ?/2 E E eu/2 


27. v/2 p fv ae 
2 = 
rf) 


-random variable with (n—1) 


for —00 < z <., 
O<u<@ 


NT's EDISTRIBUTION AND STATISTICAL INFERENCE 241 
| sE 
eS 


put t= ae w = u, so that z = infw/v and u = w. 
n Let us u/v 


pen the change of variables technique tells us that 
en 


fey D = fup SMI. 
the Jacobian of the transformation. Thus 


t 

@z Oz NE 
E w| _ U BIEN- 
JS Ou y 


ôt ôw 0 1 


ji 


' 


where J is 


substituting these values, we obtain the joint density of t and w as 
j 1 w®/2-1 gf treor, f, 
ft,w) = 5 v goi 
Qu! 27 T 2 
for -o<t<@ 


0<w<% 


1 pedi g DU) 
=— FLAN 
22/2 \Joun T G 


Integrating out w, we find the density of t as 


fit) = fF, w) dw 
0 


e] K 1 12/v) 
1 ie e (w/2)( E dw 
0 


-1 E) dy. 
Then w= (1+) and dw =2(1 +5 


Substitution gives 


folt) V 


= ort 
2 v)\o 

2/2 foun T G 
apes D/P (oe er dy 


= : 1+) 0 
ee 


INTRODUCTION TO STATISTICAL THEORY 


a Sa PO ERT 
i 1 +4" rei) 
3) Wom 


t -(v + 1)/2 
——|1+— , for -W<t<a@ 


12\-(v+1)/2 n 
e ,  . Myrra 
2’°2 
which is the so called Student's t-distribution with v degrees of freedom 


18.2.1. Properties of Student’s t-distributi 
_ 18.2 . ibution, 
distribution has the following properties, iig 


(i) The t-distribution is continuous and symmetric about the valu 
t=0, ranging from —œ to œ, i 
Gii) The mean of the ¢-distribution is p=0, when v>2. The mean is 


undefined for v=1, The variance of the t-distribution is o2=—Ł 
l v-2 
for v > 2, and the variance for v < 2 does not exist. The variance 


is greater than 1 and appr 
hoa Pproaches 1 as the degrees of freedom 


By definition, ¢ = A 
fs . ees a 2 
Von where Z is N(0, 1), Ù isa Xiv) random 


variable and Z and U are independent 


Now w= k(t) =g [2 
Lan! 


= z [2.4070] = ez) 2[ oo] = 0 U EZ) = 0) 


- and o? = Var(t) = E(t? 
= (t oe 
-e (g7) - A U i L. sy) 
ti) = z(z 5) = E(22) E E) 


Since Z? is a %2-randcm variable with 1 


E(Z2) :: 1, degree of freedom, therefore 


a fe ie ety 
)= = —— ,”/2-1 - 
fe par” ew du 


This exists if v > 2 ang thus is given by 


vriw -2)/2)_ Fios 
E a = —2 llw -2/2 


U 
2 [w = 2)/21T [w-2)/2] y -2 


` page 244, contains values of tą for s 


“que STUDENTS t-DISTRIBUTION AND STATISTICAL INFERENCE 243 ` 
a E 


Hence the variance of the t-distribution is — 
v-2 
The t-distribution is unimodal with a bell shape. The density of 
the distribution reaches its maximum at t=0 and thus the mode 
of the t-distribution is t=0. The median is also equal to zero, 
(iv) The ¢-distribution, for = i i ; 
small values of v, is flatter i 
than the standard `’ 
normal distribution ` 
which means that the t-distribution 
t-distribution is ‘more deg 
spread out in the tails Was N freedom 
than is the standard = 
normal dis tribution. 

(v) The shape of the t-distribution changes as the number of degrees 
of freedom or the sample size changes. Thus there is a diffèrent t- 
distribution for each number of degrees of freedom or sample 
size. 

(vi) The t-distribution approaches the standard normal distribution 
N(0, 1) as the number of degrees of freedom v, or the sample size 
becomes larger. 

(vii) The important property of the t-distribution is tha 
independent of the unknown population standard deviation o..It 
is therefore applied to test hypotheses about the mean ok a 

- population irrespective of what the standard deviation may be." 
18.2.2. The t-tables. The areas under the ¢-distribution have been 
tabulated for various values of t and v for convenience. The table 18.1 on 
elected values of &, where 
1 F dt 
a= Ptt) = À f (5 w+1)/2 
v la {1l+— 
Pai 


(iii) 


t it is 


of tu w) Which denote values for 


In other words, these are the values 


which the area to its right under the t- anton albania 
freedom is equal to a. Owing to symmetry of the t-distribution abou 


mean of zero, it is important to note that ti-a = ve a 
What the ¢-vatues leaving an atë of 1-a to the right and therefore an 
area equal to a to the left, is equal to the negative t valyes that anes = 
area of o in the right tail of the t-distribution. Another point to note = 
that the entries in the last row of the t-table with v=% correspone 2% 


those for a standard normal variable. 


distribution with v degrees of 


244 
Table 


The entries in this 


right under the t-distribution 
to make use of the identity t 


is necessary 


eee 
ONES OD OAD MA ow ww 
= 


ra 
A 


Bee DY 
OMAP UHM 


WS by bp 
ako 


Wb bo 


Q A 
© o 


"Table 18.1 is taken from 
Biological, Agricultural & 
dinburgh, & reproduced 


INTRODUCTION TO STATISTICAL THEORY 


18.1 Stud 


table are values of ta,(wv) fo 
with v degrees of freedom is equal to &. It 


ent’s t-Distribution 


r which the area to their 


=-t for & > 0.50. 


Table III of Fj 7 
isher“and Yates: Statistical Tables for 


Medical Resear 
al Research, published by Oliver & Boyd, Ltd. 


by permission 


of authors & publishers". 


T's DISTRIBUTION AND STATISTICAL INFERENCE 245 


THE STUDEN 
18.2.3. Distribution of Difference of Sample Means: Small 


samples and Oj = Oo. Let Xip Xip suey inp and Xop Xag noa Xon be 
o small independent random samples from two populations N(p; o3 


tw 
2) wi ; = = 
and N(Ho 9^) with the same but unknown variance o”. Let X, and X; be 


the sample means and 
n 


2 S 1 r r l = 2 1 = 
Sy = mei Dy (Xii - X)? and So = Xj- XQ)? š 
17 list n= ee 
(Š-Ž) — (u—H2) i 
We know that Z = T EET is a standard normal variable. 
| pe 
ny n 


_ As the common variance o? is unknown, its unbiased pooled estimate,’ 
2 i 
denoted by s,» obtained from both small samples, is given by 


-32 (yD sy + yD, 


2 Eu- + E X 
ny +ng-2 


S 
P nyt n,-2 


We also note that 


2 
(ny + Ng-2)8 
U= e 
o2 
van, tng-2 degrees of freedom 


herefore construct the statistic £t 
ble Z to the square root of the 
=n, + ng- 2. Thus 


is a chi-square random variable with 
„and is distributed independent of Z. We t 
as the ratio of the standard normal varia 
chi-square random variable U, divided by its d.f. 


bs N(0, 1) = SS a 
chi-square /af. NU/n +n 2 


(X, — Xp) - (Hiv) 


which conforms to the Student’s t-distribution with v=n;+ng2 degrees 
of freedom. 


18.2.4. Ass 


distribution, we make the follo 
servations X}, X» -- 


umptions in Using t-distribution. To use the t- 
wing assumptions: 
., Xp is selected randomly. 


(i) The sample of n ob 


246 INTRODUCTION TO STATISTICAL THEORY 
Gi) The population from which the small sample is drawn, is normal, 
This is essential for X and s, the two components of statistic t, to 
be independent. It has, however, been shown that slight 
departures from normality do not seriously affectt the tests. 
In case of two small samples, both the samples are selected 
randomly,. both the populations are normal and both the 
. populatiors have equal variances. 


(iii) 


, 18.3 CONFIDENCE INTERVAL ESTIMATE OF MEAN 

FROM A SMALL SAMPLE 

The procedure for constructing the confidence interval for mean 
from small samples is the same as for mean from large samples except 
that we use the Student’s t-distribution instead of the standard normal 
distribution. Thus if -t,/9 (y) and ta/2,w) denote the values of t for which 
an area equal to 0/2 lies in each tail of the Student’s t-distribution with 
v degrees of freedom, then the probability of t lying between these two 
values is given by the relation 


P [-ta/2,00) <t< taja] =1-Q 


Ci 


= ly, YM, 


That is, we have the following pabili statement 


P[- “tape. < 


X-u 
se all-a. 


Multiplying each term inside the bracket by s / vn. , subtracting X 
end then multiplying by -1 (inequality signs reversed), we get 


4/20) ==] = 1-0 
/2, n 


dence interval for p (when population 
ndom sample of size n <30, is given by 


E+t S 
a/2,(v) 
Vn 


Y S a 
P lB- tun = <u<X+t 


Henct the (1-c)100 Percent confi 
o? is vrknown) fer a Particular ra 


- THE STUDENT's f-DISTRIBUTION AND STATISTICAL INFERENCE 247 
eS 


Similarly, the (1-c)100 per cent confidence interval for the difference of 
two means },;—l»o, when the population variances are unknown but 
equal, is given by 

cme 

ny Ny’ 

where X, and X, are the means of two small random samples of sizes 
2 á 

+, * the pooled estimate of the 


(x1 = Xo) $ ta/2,w) Sp . 


n,and ng ‘from normal populations, 
population common variance, and v = n} + ng — 2 degrees of freedom. 


The probability statement has to be interpreted in the same way as 
done in the case of confidence intervals obtained by using the normal 
distribution in Chapter 15. A one-sided upper (or lower) confidence limit 


may be found by replacing the lower (or upper) confidence limit with —00 
(+0) and using Q instead of &/2. For example, the (1-a)100% upper 
confidence interval for u would be 


S 
SxX+t =; 
u u, (v) In 
Example 18.1. The masses, in grams, of thirteen ball bearings 


taken at random from a batch are 


21.4, 23.1, 25.9, 24.7, 23.4, 24.5, 25.0, 22.5, 26.9, 26.4, 25.8, 23.2, 21.9. 
Calculate a 95% confidence interval for the mean mass of the population, 
supposed normal, from which these masses were drawn. 


The 95% confidence interval for the mean mass of the population g, 
is given by 


‘Sa [ze - 2) 
n-1 


4/3.12 = 1.77, 
v=n-1= 12,and . 


= 2.179, (from the t-table) - 


t0.025,(12) 


248 INTRODUCTION TO STATISTICAL THEORY 


Substituting these values, we get 


1.77 
.21 + 2.179 | = 
24.21 Ge) 


or 24.21 + 2.179 (0.49) or 24.21 + 1.07 or 23.14 to 25.28 


Hence the 95% confidence interval for 4 calculated from the given 
sample is (23.1, 25.3) grams. 
Example 18.2 Given that 
Z; = 75, n) = 9, È (xy; —%,)? = 1482; 
Zz = 60, ng = 16, È (xo; — Z2)? = 1830; 
and assuming that the two samples were randomly selected from two 
normal populations in which oF = o% (but unknown), calculate an 80% 


confidence interval for the difference between the two population means. 
(P.U., M.A. Stat., 1966) 


The 80% confidence interval for the difference between the two 
population means |1,—Lg, is given by i 


Sa 1 1 
Zi- ) tt .s —+— 
1742 Sta) Se N] n t ng’ 


where v(d.f.) = n; + ny — 2. 
Here the difference between sample means, X,—X_ = 75-60 = 15. 


The pooled estimate of the common variance 0? is 


2 
sS = .— 7)? = 72 
P ny rn, al F Cu xy) + D (ay Xo) ] 
1482 + 1830 
= —— = ]4 = = : 
9416-2 4, so that Sp 144 = 12; 


v = degrees of freedom = 9 + 16 — 2 = 23; 
, fo.10,(23) = 1.32 (from the t-table) 
Substituting these values, we obtain 


15 + (1.82) (12) [+ =. 


or 15 + (1.32) (5) or 15 + 6.6 or 8.4 to 21.6. 


_ Hence the 80% confidence interval for [lj—[1y, calculated from the 
given information is (8.4, 21.6), ` 


249 


THE STUDENT's ¢-DISTRIBUTION AND STATISTICAL INFERENCE 


18.4 SMALL SAMPLE TESTS OF MEANS 


Statistical inference may have to be drawn from small samples when 
collecting data for large samples is costly and time consuming. So, when 
small samples are drawn from normally distributed populations with 
unknown standard deviations, the hypothesis-testing techniques are 
based on the Student’s t-distribution. A number of small sample tests of 
me iscussed in the subsections that follow. 


18.4.1,” Testing Hypothesis about Mean of a Normal 


Population when o is unknown and n < 30. Let X}, Xz, ..., Xn be the 
observations in a small random sample of size n, taken from a normally 
distributed population whose standard deviation o is unknown. Then © 
is estimated from the sample data. Let X be the sample mean and s, be ~ 
the unbiased estimate of O, then, if we wish to test the hypothesis that’ 


` -the population mean has a specified value Ho, the statistic 


_ X= Ho 
f s/ jn 
has, when the hypothesis is true, a ¢-distribution with v = n — 1 degrees 
of freedom. Hence the sampie ¢ value is the test-statistic for testing a 
hypothesis about the mean of a normal population with unknown 
standard deviation. 
The procedure for testing the hypothesis Hg : h = [lg is given below: 
(i) Formulate the null and alternative hypotheses about jt. Three 
possible forms are 
(a) Ho: =o, and H,:|t#}lo (two-sided alternative), 
(b) Ho: [tS ho, and H;:p > pHo (one-sided alternative) 
(c) Ho: |t2Ho, and Hy: | < Ho (one-sided alternative) 


(ii) Decide upon the significance level a. <S 7 ; 


4 


(iii) The test-statistic to use is 


p= Xoo 
(0 ikh. 
which, if Hg is true,~follows~a Student’s' t-distribution with 
degrees of freedom. 


(iv) Determine the critical region which for Hg corresponding to 
different alternative hypotheses is given as follows: 


| | INTRODUCTION TO STATISTICAL THEORY 
250 


When the alternative 
hypothesis is 


(a) H,: #4 Wo 
(b) Hi: H > Uo 
(© Hy: H < Ho 


Owing to symmetry, 
alternative, is usually stated as |t| 2 f4/2,(n-1) 


(v) Compute the t-value from the given sample data. 


" the critical region will be 


t S -ta/2,n-1) and t 2 fa2,(n-1) 


tz ta, (n-1) - 


t S -ta n-1) 
the critical region in case of two-sided 


(vi) Decide as below: 
Reject Ho, when the computed t-value falls in the critical region, 
otherwise accept it. n i y 
Example 18.3. Ten individuals are chosen at random from a 
normal population and the hsi hts are found to be in inches 63, 63, 66, 
67, 68, 69, 70, 70, 7] and In the light of these data, discuss the 


suggestion that mean height in ‘ihe (population is 66 inches:~- 
~ (P.U,, B. A/B. c. 1969) 


fat 


(i) We state our null and alternative hypotheses as 


Ho: [1 = 66 and H,: p #66. (two-sided) 
i nE aș 
(ii) We set the significance level at a = 0.05. it (A) 
(iii) The gjinia to use is = = 
/n 


2 
Ea Jea? ay Ee Sx a je 


which, if Ho is true, be the t-distribution with n — 1 = 9 degrees 
of freedom. 


(iv) The critical region is |t| 2 to o95,(9) > 


REJECT.’ accept ¥% "REJECT: ” 


216% 


a 
THE STUDENT's (-DISTRIBUTION AND STATISTICAL INFERENCE 251 


(v) We compute the t-value from the sample data as below: Fá 


Ex; = 63 + 63 +... + 71 = 678, 
3 anonymous 
Ex; = (63)? + (63)? + ... + (71)? = 46050, 


2x; 678 
e ae re 
ae = 10 67.8 inches 


1 eg i _ 2; 
s? sE e-a = — [ee er). 


= $ [46050 - 45968.4] = 9.0667, so that 


s= «9.0667 = 3.01 inches. 


67.8 — 66_ =- £18) (3.1623) 
Hence t = z= ki = =e = 1.89. 
s/n 3.01 3.01 / V10 3.01 ` > 


€D Conclusion. Since the computed value of t = 1.89 does not fall Í 
in the critical region, we therefore do not reject Hg and m: ay 
conclude that the population mean is 66 inches. 


Example 18.4. In a sample survey, six estimates were made of the 
same mean. When the population mean became known, the following 
errors were computed: —35, 111, -88, 47, —12, 26. Are these errors 
consistent with the hypothesis that the population of errors has a zero 
mean? Assume that the errors are normally distributed. 


@ We formulate our null and alternative hypotheses as M fn 
Hy: =0 and H,:1#0. 
(ii) We choose the level of significance at & = 0.05. 
(iii) The test-statistic to use is WA 
pay 


s/n 


which, if Ho is true, has the Student’s ¢-distribution with n+1=5 
degrees of freedom. 


(iv) The critical region is |t| 2 to.025,6) = 2-57. 


(v) We compute the t-value from the sample data as follows: 


INTRODUCTION TO STATISTICAL THEORY 


s= «4183.1667 = 69.16 


erp A 8.17-0_ _ (8.17) (2.4495) _ 0.99, 
y s/n 69.16/46 69.16 


' (vi) Conclusion, Since the computed value of t = 0.29 does not fall 
in the critical region, so we do not reject Hy and may conclude 
that the data are consistent with the hypothesis that the 
population of errors has a zero mean. 


A AZ, Testing Hypotheses about Difference of Means of 
Two Normal Populations when 6, = Og but unknown. Let Xiv 
Xip 9 Xin, and X21 Xop, -s Xon, be two small independent random 
samples from two normal “populations with means p; and He and 
standard deviations O} and O% respectively. If O} = 92 (=o) but 
unknown, then the unbiased pooled or combined estimate of the 
common variance O? (The term common variance means that each 


population has the same variance), is given by 


2 


s = — 


nj + ng- 2 


THE STUDENT's--DISTRIBUTION AND STATISTICAL INFERENCE 253 


. 1 i 1 S 
where sî =—— 2 (X,;-X,)? and s = — I Zj- 9’. 
nj-—1 ng- 1 


It has been previously shown that the statistic 


has a Student's t-distribution with v = n, + ng ~ 2 degrees of freedom. 
Hence for small samples, taken from normal populations with unknown 
but equal standard deviations, it is used as the test-statistic for testing 
hypotheses about the difference between two population means. 

Suppose we wish to test the hypothesis that the difference between 
means has a specified value Ag, i.e., Ho + fy — H2 = Ao. If Ho is true, the 
test-statistic becomes 


which conforms to the t-distribution with v = ny + ng ^2 degrees of 
freedom. In case Ay=0, the test-statistic reduces to 


The procedure for testing the hypothesis Hg : Hy — H2 = Ag in case 
of small independent samples and when O} = Og, may be stated as 
below: 


(i) Formulate the null and. the alternative hypotheses; given 
O} = Og = O unknown; 
Ho: H17 be = Ay against the appropriate alternative. 
(ii) Decide on the significance level &. 


Gii) The test-statistic under Ho is 


which has a ¢-distribution with v = ny + ng ~ 2 degrees of 
freedom. 


4 INTRODUCTION TO STATISTICAL THEgpy 
- The critical region is | 
t S -ta/2,0) and t 2 ta/2,wY when H is H) ~ p # re 
t< -tuu)» When His H; -H2 < Ao 
t2 taw)’ when H, is Hy -~ H2 > Ao. 


(v) Compute the t-value from the given data. 


(vi) Decide as below: 
Reject Hy if t falls in the critical region, accept Hy otherwise, , 


It is wi asiasi 
we generally suggest that the normal distribution be used to 


approximate the t-distribution because the t and z-values will then be 
quite close. The t-distribution due to this procedure is sometimes 
referred to incorrectly as applying only to small samples. However, it is 
emphasized that whenever O is unknown and the population is normal, 
the application of the t-distribution is always correct. (See Exercise 


18.82). PA 
C Example 18.5) Given the following samples from two normally 
distri ations with the identical standard deviations but 
unknown, test Ho : H; — Hg < 3 against Hy . [ty — by > 3. Let a=0.10. 
51, 42, 49, 55, 46, 63, 56, 58, 47, 39, 47 

38, 49, 45/29, 31, 35. 


| (LU., M.Sc. 1987) 


(i) We state our null and alternative hypotheses as 


Ho: Hı- Hg <3 and Hy: p;—H,>38 (one-tailed) 
(ii) The significance level is set at a = 0.10. 


(iii) The test-statistic to use is 


(X, - Xq) . 
t= cia 

E E E 

PN, n 


which, if Ho is true, has a Student’s t-distribution with ny tng? 


(=11 + 6-2 = 15) degrees of freedom. 


Y 


orth remarking that when the sample size is large, i.e. n > a 


THE STUDENT's pay a STATISTICAL INFERENCE 255 


(iv) The critical region consists of all t-values that are greater than or 
equal to fo 19,(15) = 1.341. 


(v) Computations. 


Now the sample means are X, = ` = = = 50.3, 
3 i y 
Lrj r 
z, = Z = 221 L 37.8, and 

Ng- 6 ~ 

a ey? (553)? s 
sp x E L OELE 

> œu)? = 2x4; a 28315 - 7 


izl 
= 28315 — 27800.82 = 514.18, 
ng! (Exa)? 2 
3 (xg Zg)? = Zey- = 8897 ae 
jel Ng 
~ 2 8897 — 8588.17 = 308.83 


a sa =z)? + Exo; — Xp)? 
Sp araa Zeu x) + Laj 2] 


2 514.18 + 308.83 „BN = 54.87, so thet 


11+6-2 
‘=N 54.87 = 7.41 


‘INTRODUCTION TO STATISTICAL 
HEO 


; (50.3 = 316 
hus f= ae 
: qVALA\]4 t pa 
gl lue of t= 2 3 
ince the computed va 88 falls i 
vi)” cant oh we reject Ho and.accept H4. N the 
critical 
‘oa planted in one variety of 
8. om an area P guayule ( 
Example. 1 i 


521, 5.10, 6.04, 4-41, 5.22, 4.45, 4.84, 5.88, 5.9, 
5 09, 5.59, 6.06, 5.59) 6.74; 5.55. 


4.28, 7.71, 6-48, 7.71, 7.37, 7.20, 7. 


06, 6.40, 8.93, 
5.91, 5.51, 6.36. = 


x Test the hypothesis of no difference between means of populations 


of rubber percentages. Assume the populations of rubber percentages are 


Aberrant 


“approximately normal and have equal variances. (P.U., B.A/B.Sc. 1983) 


(i) We formulate our null and alternative hypotheses as 
Hy: tir He =% ie. there is no difference between means; and 
Hy: 7 Ho #9, ie. the two means are different. 


(ii) We set the significance level at a = 0.05. 
(ii) The test-statistic, if Hy is true, is 


which has a Student’s t-distribution with v = 7, + 727 2, i.e. a 


degrees of freedom. 

(iv) The critical region consists of all t-values which are greater as 
or equal to to 995 (95) = 2.06 and which are less than OF it 
—t,025,(25) = ~2.06. 

oo ` nt 

(v) Computations. Let X yand Xoj represent the offtype and abert? 


measurements respectively. Then 


THE STUDENT's t-DISTRIBUTION AND STATISTICAL INFERENCE. 257 
16 (2x4)? 84.25)? 
¥ eu- D? = Lx -i = 478.9779 - (84.25) 
il nı 15 


= 478.9779 — 473.2042 = 5.7737; 


12 (Xx,,)? 2 
È Gg - Xp)? = Ezy — SL = 561.6402 - aoe 
js1 no 1Z 
= 561.6402 — 545.6705 = 15.9697 
.—%,)? .— Xo) 
Z Gey — Ey? + E yF _ 5.7737 + 15.9697 


15 + 12-2 


+, = 
nyt ng-2 


= 0.8697, so that 


Sp = «0.8697 = 0.93, and 


62-6.714__ z1 
5.62-6.14 _ -112 944, 


pa 
1 1 0.36 
0.934 [z+ a 


(vi) Conclusion. Since the computed value of t=—3.11 falls in the 
critical region, we therefore reject the hypothesis of no difference 
between the two means. We conclude that there is sufficient 
evidence to indicate a difference in the means of rubber 
percentages. 


18.4.3. Testing Hypotheses about Difference of Means of 
Two Normal Populations when O, # O2 and unknown. Suppose 
that we are given two small random samples Xip Šip ev Xin, and X91, 
Xoo, -» Xon, from two normally distributed populations with means 
} and ply and standard deviations 0, and 69 respectively. If O} # O2 and 
unknown, we use their sample estimates 5; and sọ to compute the 
standard error of the difference between means and get l 


i aoa bgp ee 2 , ; 
as there is no point in combining $; and s, te obtain an estimate of a. 


non-existent common population variance. Consequently, for-testing the 
null hypothesis Hg: Hy = Ha the test-statistic, if Hp is true, is given by 
the relation 


INTRODUCTION TO STATISTICAL yy 
HEQ] 
Y 


When the value of v does not happen to be an integer, it is always 
e 


rounded down for degrees of 


called the Behrens-Fisher problem. The vest of the procedure is the 


same. x : 
Example 18.7. Given two random samp1°s of size n, = 7 ant 


no = 6 from two independent normal populations, with X, = 10.91, 3+ 
480 sı = 6.34 and sọ = 3.09, test the hypothesis av the 0.05 level of 
significiance that H4 = y against the alternative that pH; # Hp. Assume 
that the population variances are unequal. 


(i) We state our null and alternative hypotheses as 


y 


(ii) The level of significance isset at & = 0.05 


Gii) Since the populations have unequal variances, the test statistic if 
Ho is true, is 


A , «peg Of 
which has approximately a t-distribution with v degree 
freedom, where 


_ Wi ny + (53/9 


2 2 
iy (53/09? 
n)-1 ng-1 


(iv) 


ata as follow“ 


We then compute the values of t’ and v from the d 


freedom. This general problem is often 


THE STUDENT's t-DISTRIBUTION AND STATISTICAL INFERENCE 259 


t = = 2.33, and 


(6.34)? A (3.09)2 2.708 
7 6 


= 16.34/71 + (3.097/6 _ 5379 = 5 (the df i 
[(6.34)2/7)2 p ((3.99)2/6]2 6.00 7 e d.f. is 
6 5 


always rounded down as in this calculation) 
(v) The critical region is \t'| = to.025,(8) = 2-306 


(vi) Conclusion. Since the calculated value of t’ falls in the critical 
region, so we reject our null hypothesis of equal means. 


18.4.4. Testing. Hypotheses about Two Means with Paired 
Observations. In testing hypotheses about two means, we have used 
independent samples, but there are many situations in which the two 
samples are not independent. This happens when the observations are 
found a pairs as the two observations of a pair ere related to each other. 
Pairing occurs either naturally or by design. Natural pairing occurs 
whenever measurement is taken om the same unit or individual at two 
different times. For examples, suppose 10 young recruits are given a 
strenuous physical training programme by the Army. Their weights are 
recorded before they begin and after they complete the training. The two 
observations obtained for each recruit, ie. the before-and-after 
measurements constitute natural pairing. Observations are also paired 
to eliminate effects in which there is no interest. For example, suppose 
we wish to test which of two types (A or B) of fertilizers is the better. 
The two types of fertilizers are applied to a number of plots and the 
results are noted. Assuming that the two types are found significantly 
different, we may find that part of the difference may be due to the 
different types of soil or different weather conditions, etc. Thus the real 
difference between the fertilizers can be found only when the plots are 
paired according to the same types of soil or same weather conditions, 
etc. We eliminate the undesirable sources of variation to teke the 
observations in pairs. This is pairing by design. 

When the observations from two samples are paired either naturally 
or by design, we find the difference between two observations of each 
pair. Treating the Aifferences aS a random sample from a norma: 
population with mean Hp = Hı — H2 and unknown standard ceviation 
Op, we perform a one-sample t-test on them. This is called a pairec 
difference t-test or a paired t-test. 


260 


or, 


INTRODUCTION To 3T, 
E a a ATST 


Testing the hypothesis Hp: Hi = Hz against 3 A 
: . = inst gi 1: 
equivalent to testing Hy: HD 0 soe : Hp #0. Ms k, 
a get” Xai denote the diffe:erce between th 
observations in the ith pair. Then the sample mean pi 
he differences are and Sty 


deviation of t 
_ Ld; „=d 
ga Ai ond ay = 2 d) 
n 
where n represents the number of pairs. 


Assuming that (i) dy, do» -+ d, is a random sample of g: 
and (ii) the differences are normally distributed, the test-statigtg 
c 


d 
ta — = 
Sd f: vn , 
follows a t-distribution with v = n — 1 degrees of freedom. The rest f 
procedure for testing the null hypothesis Hg : Up = 0 is the same " 
Example 18.8, Ten young recruits were put through a sten 


physivat training programme by the Army. Their weights were reco: 
before and after the training with the following results: i 


_|Recruit 


Weight before | 125 


76 
136 


Usi = 
bea oe ey 0.05, would you say that the programme affects th 
Werage Weight of recruits? Assume the distribution of weights bel 


a F à Hira- i 
tely-normal. (P.U., B.A/B.Se. 1% 


The pairi 
Ing wa ` i 
the g Was natural here, since two observations-are maie t 


same recruj i 
Bri at two different times. The sample consists į 
two measurements on each. 


The test is carried out as below: 


(i) Westa 
H te our null and alternative hypotheses as 
0' Uy = 0 and Hyp #0 
(ii igni 
m The Significance leye] is 
(iii) “eo . 


T tati 
he test-statistic under H, is 
0 


a on 
sy /\n’ 


which has si : n 
tlistri á ' 
“stribution with 2. — 1 degrees of freedoin- 


(v) 


Now 


(vi) 


THE STUDENT's t-DISTRIBUTION AND STATISTICAL INFERENCE 261 
THES 
(iv) 


The critical region is |t| 2 to.025,(9) = 2:262. 


Computations. 


Difference, d; 


(after minus before) 


2 X-ð? 1 2_ (Zap? 
ld r ] 


5 ; 
d n-1 n-1 i 


1 (47)2] 673 — 220.9 

a2 _ 4D") _ Ofte AE = 50.28, so that 

3 [673 10 | g- 5 so tha 
Q4 

sa = 50.23 = 7.09. per 

ne d 4.7 _ (4D 3.16) = 2.09. 

sq/Nn 7.09 / [10 7.0 


Conclusion. Since the calculated value of t=2.09 does not fall in 
the critical region, so we accept Hy and may conclude that the 
data do not provide sufficient evidence to indicate that the ` 
programme affects average weight. 


a 
i 


INTRO l 

LE DUCTION To STA ‘ 
Example 18.9. The following data give a n.: USTe, 
‘varieties of wheat. Each pair was planted in a diffe Paireg i 


h s “ y st 
hypothesis that the mean yields are equal. | rent leat uy THE STUDENT's DISTRIBUTION AND STATISTICAL INFERENCE 263 
k -2.8 > _ (-2.8) (3.1623) _ _¢ 74 


i d 
\ ars panca = 
` sa/\n 1.32/ 10 1.32 


(vi) Conclusion. Since the calculated value of t = —6:71 falls in the 
critical region, we therefore reject Hy. The data present sufficient 
evidence to conclude that the mean yields are not equal. 


P.U., BA/BS. 


The pairing was by design here, as the yi 
aeons, factors such as fertility ie é nh re are affected ly Example 18.10. An experiment was performed with seven hop 
conditions and so forth. Th : , and, fertilizer ayy); Yay - plants. One half of each plant was pollinated and the other half was not 
. The test is carried Pplieg „S z : : 
ri k out as below: » Weal, pollinated. The yield of the seed of each hop plant is tabulated as follows: 
i 2 state our ive | 
Á null and alternative hypotheses as 
Hy: Up = 0 ( = i 
o: Hp or H; = Ho), i.e. the mean yi s 
í ie i 
T yields are equal an) ° [Non-pollinated | 0.21 | 0.12 | 0.82 | 0.29 | | 0.20 | 
1: Hp # 0, ie. the mean yields differ. on-pollinate: 5 . ` 0.29 | 0.30 


Wesdlect'the level of'sieni (a) Determine at the 5 per cent level whether the pollinated half of 
an wn: of significance at & = 0.65. ; the plant gives a higher yield in seed than the non-pollinated 
e:test statistic to be used is half. State the assumptions and hypotheses to be tested and 
carry through the computations to make aecision. 


(ii) 


d 

à t = —— = (b) How can the experimenter make a Type-I error? What are the 
k Sa / Nn consequences of his doing so? 

(c) How can the experimenter make a Type-II error? What are the 


where d = %, -3 2 , 
17 Xand s; is t i ; 
qa 1S the variance of the difference. lft: consequences of his doing so? 


populations ar y P 
are normal, this statistic, when Hp is true, wsi 
1 


Student’s t-distribution with ( (a) Give 90 per cent confidence limits for the difference in mean 
(iv) The critical didi yields. (P.U., M.Sc. 1971) 
region i 
TE gion is |t| > to.025(9) = 2.262 n=l (a) (i) We wish to decide between the hypotheses 
10 j : 
and Variet ca oat X; and Xo; represent the yields of Variety! Ho: The pollinated half does not give a higher mean yield than 
y 4! respectively. Then we have the non-pollinated half. In other words, 
a = Ky “Xy a Hy: hp < Hy Where Hp denotes the mean yield of pollinated and 
= —2,-2,-2 -2 uy that of the non-pollinated; and ' ` 
me —8, E Pn S = aa A . : : 
Dd; = —28 2 6, -2, —2, —4, —3. H: Hp > By» that is the pollinated half does give a higher yield 
we aed 2 d; = 94. than the non-pollinated. (This is the claim) (one-tailed test) 
Now J= Xd; = 228 Gi) The significance level is set at & = 0.05. 
P j : 10 = ~2.8, and Gii) As the observations are paired, therefore the test statistic to be 
s? 1 used is 
= [z d? (Xd;)2 1 (28)? a 
neg mi ] = 1 [o4 - 2] psa 
P 9 10 sa / Nn 
Ba Assuming that (1) the differences in yields are a random sample 


= 1.73 i 
j 33, so that sy = 1.32. | from the population of differences and (2) the population of 


264 


(iv) 


Now- 


(v) 
(vi) 


(b) 


INTRODUCTION TO STATISTICAL THEORY 
differences is normally distributed, this statistic if Ho is true, has 
a Student’s t-distribution with (n—1) d.f. 

We then compute the value of ¢ as below: 


d = = = SA4 E 0.491, and 


d;)? 
ibia 


0.2093 


1 3.44)2 
= [1.9008 - So" = <= = 0.0849, so that 


Sq = 0.0349 = 0.1868 


"T. m= 491) (2.646) _ 1.8045 _ gg 
sa/Nhn. 0.1868 0.1868 ~ 9-6 


Conclusion, Since the calculated value of t=6.96 falls in the 


ng, the consequences 
Se the pollination when it actually does 


(c) 


(d) 


accepting the null hypot 
Increase the yield and the 
would be a loss of Potential increased yield. 

The 90% confidence limits for the difference in means Hı — Hs in 
case of paired observations, are given by ! 


= S 
d E taan- Gr 


Substituting the values, we get 


0.491 + ar 
7 


or €.491 + 0.137 or 0.354 to 0.628, 


Hence the 90% confidence limits for the difference in mean yields, 


18.1 


18.2 


18.3 


18.4 


Hy-Hg, are (0.35, 0.63). 


EXERCISES 
(a) Why is the z-test usually inappropriate as a test-statistic 
when the sample size is small? 
(b) Define "Student’s z", What are its assumptions? Explain 
briefly its use and importance in statistics. (P.U., B.A/B.Sc. 1985) 
Derive the distribution of Student’s t and discuss its chief 
properties. = (P.U., M.Sc. 1970) 
(a) Is E(t) = 0 for all values of v? 


(b) With ż defined by t = á , show that Var(t) = — for 
4 x2 Jv v-2 
v> 2, 


(c) Prove that the t-distribution approaches the standard normal 
distribution as the number of degrees of freedom v becomes 
infinite. (P.U., M.Sc. 1970) 


A random sample of size n is drawn from a normal population 
with mean 5 and variance o2. Answer the following: 

(i) If n = 25, Z= 8 and s = 2, whatis?? 

(ii) If n = 9, ¥ = 2 and t = -2, what is s? 

(iii) If n = 25, s = 10 and t = 2, what is z? 

(iv) If s = 15, = 14 and t= 3, whatisn? (P.U., B.A/B.Sc. 1988) 


264 INTRODUCTION TO spay. 
Ti 


differences is normally distributed, this statistic ee 
i 


a Student’s t-distribution with (n—1) d.f. a ie te X 


(iv) We then compute the value of ¢ as below: 


Pollinated Non-pollinated 


Xd; 3.4 
N =i _ 3.44 
ow d n 7 7 = 0.491, and 
2 1. .)2 
i; Sa s- [gp È ] 


(v) 


wi) 


(b) 


n=1 7! n 


1 
= = [1.9008 - £449?) _ 0.2093 


7 ar a = 0.0349, so that 
Sa = ¥0.0349 = 0.1868 
d 
te : 
JE = (0.491) (2.646) 1.3045 
a/Ņn: 0.1868 ` 0.1868 = 9-96; 


The erit; ; 
a e critical region is ¢ > too 
Onclusion, Sj fea 
c . Si 
rejection rel. oe calculated oe ae 
conclude that ther we reject our null hypothesis ee 


ei . 
mean yield of seed. S evidence that pollination gives a higher 
The experimen 


mui hypothesi 
rejesting the 


= 1.943. 


ter can make 
s. In this 


pollinated a eo 


seed from no, SIs actually n ; 
On-pol}; ot Id 0 
would be that pollinated greater than the mean ylé 


n plants In so d è uences 
not incr ill use tha nn: oing, the conseq 
M-rease the yield, se the pollination when it actually does 


a Type-I error by rejecting 2 1 
othe case, the Type-I error is made y 
sis when the mean yield of seed f0", 


THE STUDENT's t-DISTRIBUTION AND STATISTICAL INFERENCE 265 
(c) The experimenter can make a Type-II error by accepting a false 
null hypothesis. In this case, the Type-II error is made by 
accepting the null hypothesis when pollination actually does 
increase the yield and the consequence of committing this error 

would be a loss of potential increased yield. 


(d) The 90% confidence limits for the difference in means H} — He in 
case of paired observations, are given by 


3 Ss 
d t tyjoin-1° a - 


Substituting the values, we get 


0.491 + 1.943 22888 


V7 
or 6.491 + 0.187 or 0.354 to 0.628. 


Hence the 90% confidence limits for the difference in mean yields, 
UH are (0.35, 0.63). 


EXERCISES 


18.1 (a) Why is the z-test usually inappropriate as a test-statistic 
when the sample size is small? f 

(b) Define "Student’s t". What are its assumptions? Explain 

: briefly its use and importance in statistics. (P.U., B.A/B.Sc. 1985) 

18.2 Derive the distribution of Student’s t and discuss its chief 

properties. = (P.U., M.Sc. 1970) 

18.3 (a) Is E(t) = 0 for all values of v? 


Z 
(b) With ż defined by t = P 
á x2 /v 


v> 2. 

(c) Prove that the t-distribution ap 
distribution as the number of degrees of 
infinite. 

18.4 A random sample of size n is drawn fr 
with mean 5 and variance o2. Answer the following: 

(i) If n = 25, X= 38 and s= 2, what is £? 


show that Var(t) = > for 
u-2 


proaches the standard normal 
freedom v becomes 
(P.U., M.Sc. 1970) 


om a normal population 


Gi) Ifn =9, = 2 and t= —2, what is s? 
(iii) If n = 25, s = 10 and t = 2, what is x? 


(iv) If s = 15, Z = 14 and t = 3, what is n? (P.U., B.A/B.Sc. 1988) 


Ea 


“a 


268 ` INTRODUCTION T 
O STATISy 
STic 


' 18°14 A sample of 12 jars of peanut butter was taken 
jar being labelled "8 ounces net weight." The in 
in ounces are: 8.2, 8.0, 7.6, 7.6, 7.7, 7.5, 7.3, 7.4, 7 

Test whether these values are consistent with a Rhy 
of 8, Assume that the weights are normally distribe tion a 
18.15 The nine items of a sample had the followin a, iy 
45, 47, 50, 52, 48, 47, 49, 58, 51 al 
Does the mean of the nine items differ si 
assumed normal population mean of 47.5 

) Ten cartons are taken at random bis 
machine, The mean net weight of the 10, 
and he sum of squared deviations is 0.2 
mean differ significantly from the j t 
1476 A random sample of 16 valu sn 
pe es from a norm 
inches 
this mean equal to 135 (inch 


AL TH 
from al 


i 5, 8. 0 7 viet 


alues: 


8nificantly fro 


n automatic fi 
cartons is 15.90 oz 
76. Does the samp 
ed weight of 16 9 


m an 


ling. 


is in proper 
Icance of 0.05 and ita, = 


1350, 1610 

: , 1590 e lifetimes 
Manufacturer's olaj p rs. Does this are 1472, 1486, 1401, 
ulbs are aim evidence support the 


Assu 
normally diet: m 
p29 (@) Describe th Y distributeq, 


; e proc 
equal edur , 
lty o e for testing hypotheses about the 


sampl cans of 0 nor 

(b) Two rang rmal populations for small 
Populations wi Pes taken ; 
results: with an j independently from normal 


identical yay 
Variance yield the following 


s 0 
dividua; ,,” Sah. 


THE STUDENT's t-DISTRIBUTION AND STATISTICAL INFERENCE 269 


Test the hypothesis that the true difference between the 
population means is 10, that is, that .-}1,=10 against the 
alternative that j..-11)>10 at the 5 per cent level of 
significance. (P.U., B.A/B.Sc. 1989) 
18.20 The weights in grams of 10 male and 10 female juvenile ring- 
necked pheasants are: 
Males: 1293, 1380, 1614, 1497, 1840, 1643, 1466, 1627, 1383, 
1711; 
_ Females: 1061, 1065, 1092, 1017, 1021, 1138, 1143, 1094, 1270, 
1028. 
Test the hypothesis of a difference of 350 grams between 
population means in favour of males against the alternative of a 
greater difference, using a 0.05 level of significance. Assume that 
the weights are normally distributed. (P.U., B.A/B.Sc. 1987, 92) 
18.21 (a) The heights of six randomly selected sailors are in inches: 63, 
65, 68, 69, 71 and 72. Those of ten randomly selected soldiers 
are 61, 62, 65, 66, 69, 69, 70, 71, 72 and 73. Discuss in the 
light of these data that soldiers are on the average taller than 
sailors. Assume that the heights are normally distributed. 
(b) Eight pots, growing three barley plants each, were exposed to 
‘ a high tension discharge while nine similar pots were 
enclosed in an earthed wire cage. The number of tillers 
(shoots) in each pot were as follows: 
Caged: 17, 27, 18, 25, 27, 29, 27, 23, 17. 
Electrified: 16, 16, 20, 16, 21, 17, 15, 20. 
Discuss whether electrification exercises any real effect on 
tillering. 
18.22 Twelve hogs were fed on diet A, 15 on diet B. The gains in 
weights for the individual hogs (in pounds) were as shown: 
A: 25, 30, 28, 34, 24, 25, 13, 32, 24, 30, 31, 35 
B: 44, 34, 22, 8, 47, 31, 40, 30, 32; 35, 18, 21, 35, 29, 22. 


be drawn from this experiment? 


What conclusions may : 
ns of the t- 


18.23 (a) What statistical hypotheses can be tested by mea 
distribution? 
(b) A group of 12 chit 
intelligence quotients: 
108, 118, 132 and 128. 
children have come from @ l 
1Q is 115? 


dren are found to have the following 
112, 109, 125, 113, 116, 131, 112, 123, 
Is it reasonable to suppose that thes2 
arge population whose ave"-ge 


270 INTRODUCTION To STATIST 
a ICAL 


(c) A second group of 10 children is tested, fens R | 
following 1Q’s: j u ting tk th 


117, 110, 106, 109, 116, 119, 107, 106, 105 seca 108 
Is this group significantly different from the 
lrst 
; . (LU Eroup) 
18.24 (a) Two separate groups of subjects u D AeA Se, 1987 
experimental group (Group E) had 10 Subject. tested, The 
S; th 


` group (Group C) had 9 subjects. Th a 
. The da re pi Ontr 
the scores are assuined to be normally deduce ‘ile Woy 
Group E: j | 
roup Æ: 12, 13, 16, 14, 15, 12, 15, 14, 13 and 16. rad 
Group C: 10, 18, 14, 12, 15, 16, 12, 14 and 11. per! k 
HL, 


Determine whether th 
Jete e means of th r 
significantly at the 0.05 level of dlisiticanes nt i 


(b) The strength of Tropes made out of cott 
on measurement the following values: 


Cotton yarn: 7.5, 5.4, 10.6, 9.0, 6.1, 10.2, 7.9, 9 7,7 
Es 1 10.2, 7.9, 9.7, 7.1, 8.5 
8.3, 6.1, 9.6, 10.4, 6.4, 10.0, 7.9, 8.9, 7.5, 9.7 


on yarn and coir gave 


gnificant difference in the strength 
t 0.05 level of significance, 


Ist group: 
group: 2.6, 1.5, 4, 1, 3.5 3.4, 2.5, 3, 4 
"o dt, 4.0, 0, 4, 3.5. 


3.5 

tvs Bien hoa ie 1.5, 2.5, 3, 2, 3, 2, 1.5, 2.5 
; rmi a 

assuming the Standa i rhe hether the means differ significantly 


viati i 
lon for each group is the same. 


les A 

of size "j= and n,=16, from two 
. hypothesi %1=75, X,=60, s;=13.61 
1= Hy against th esis at the 0.05 level of significance 


Population e alternati 
Shave unequal mae Hl) >jlo. Assume that the 


2nd group: 


18.27 (a) Distinguish betw 
e 


and a paired-sam 
are made in each 


en situations re 


ple t-test, Wh 
use? 


A quiring a two-sample t-test 
¢.stributional assumptions 


(b) The wei i 
w e weights of 4 Ders 
ees acter they ons before t 


stopped hey stopped smoking and 5 


Sinale 
moking are as follows: 


T 


TUDENT's t-DISTRIBUTION AND STATISTICAL INFERENCE 271 


HE S 


Before 148 


After 154 
Use the t-test for paired observations to test the hypothesis at 
the 0.05 level of significance, that giving up smoking has no 
effect on a person’s weight. (P.U., B.A/B.Sc. 1969, 90) 


18.28 To verify whether a course in statistics improved performance, a 
similar test was given to 12 participants both before and after the 
course. The original grades recorded in alphabetical order of the 
participants were 44, 40, 61, 52, 32, 44, 70, 41, 67, 72, 53 and 72. 
After the course, the grades were in the same order 53, 38, 69, 
57, 46, 39, 73, 48, 73, 74, 60 and 78. 

(a) Was the course useful, as measured by performance on the 
test? Consider these 12 participants as a sample from a 


population. 


(b) Would the same conclusion be reached if tests were not 


considered paired? Use 5% level of significance in both cases. 
(P.U., B.A./B.Sc. 1985) 


18.29 Inacertain experiment to compare two types of sheep food A and 
B, the following results of increase in weights were observed: 


mene [i] ela |<] ee | a1 8 | 


Food A 49 | 53 | 51 | 52 | 47 52 | 53 
Food B 52 | 55 | 52 | 53 | 50 54 | 53 


(a) Assuming that the two samples of sheep are independent, 
can we conclude that food Bis better than food A? 
same set of eight sheep were 


(b) Examine the case when the 
used in both the foods. 
o 18:90) The government awarded grants to nine different experimental 
stations of the agricultural department to test the yield 
wheat. Five acres of each variety 


capabilities of two varieties of à i 
ar? planted at each station and the yields, in maunds per acre, 


Variety 1 
Variety 2 


272 INTRODUCTION To staz, 
IST, 


18. 


18.32 


ICAL 3 


Test the hypothesis, at the 0.05 level of si TH 
; sait significa Eory 
average yields of the two varieties of wheat aten nee, tha S 


alternative hypothesis that they are unequal anal aBainst 

distribution of yields to be approximately nor Ty SSsumin rs 

pairing is necessary in this problem, ormal, Explain ka 

31 A taxi company is trying to decide whether th i 
instead of regular belted tires improves fu 
cars were equipped with radial tires and d 
test course. Without changing drivers 
equipped with regular belted tires and d 
test course. The gasoline consu 
recorded as follows: 


eu ; 
a aro! radial tig 
riven oy nomy, TWelye 
the sa er a Prescribeg 
riven me cars Were then 
mption in again over th 

m per liter, fies 


4.2, 4.7, 6.6, 7.0, 6.7, 4.5, 5.7, 6.0, 7.4,4.9 6 

4.1, 4.9, 6.2, 6.9, 6.8, 4.4, 5.7, 5.8 aa 
At the 0.025 level of si i 
equipped with radial tir 
equipped with belted 
normally distributed, 


» 6.9, 4.7, 6.0, 4.9 


x 
S 
w 
> 
N 
w A 
n 
l 
ji 
ke 
co 
g 


grou : 
Scores were obtained: P on an achievem 7 


Apply t- 
1n achie 


test to determine 
vement of the tw 


Whether there is P 
o &roups. 


ignificant difference 
M.Sc., RU: 1988; I.U., 1992) 


>. 
0 o% ot, 
+ o @ b 
oP oge eo of s of 08 o% of 
o %% ee 
o 


19 


The FDistribution and 
Statistical Inference 


19.1 INTRODUCTION 

In the preceding chapter, we used the t-distribution to test the 
hypothesis about the difference between two means under the 
assumption that the two random samples are drawn indep .ndently from 
two normal populations that have equal variances. But in actual practice, 
the variances may or may not be equal. To check the assumption that the . 
two norme:'y distributed populations have equal variances, we use BA 
+. aportant distribution, called the F: „distribution, which is the sampling 
distribution of the ratio of two independent and unbiased estimates of 
ulation variances. If the unbiased estimates denoted 
of sizes ny and no, drewn 
then (assuming 


2 2 
the pop by 5, and 85, 
have been obtained from two random samples 


from normal populations having the same va; .unces, 


that s? is larger than s3) the ratio is given by 


decor (1882-1974) in honov> of he 
her (1890-1967, who in 1.924 


leter 


This ratio was named F by G.W. Sne 


great British statistician, Sir K.A. Fis r (183. , 
developed its distribution a8 the Z-distribut.on which was 


. ‘ T = e?z, 
transformed into the F-distribution, using the relation »' = € 


i ic 2 the ratio 
Dividing the two estimates bythe population vari nce 67, the l 


becomes 


273 


274 INTRODUCTION To STATISTICA, a | 


Ory 
si / oO? 
a a 
55/0 
2 
(nı a 1) Si 5 (x 3 

i i =X) 

It has already been shown that —— li È) 


Le. 
o2 í a : 


distributed as a %?-variable with v} = n i 


— 1 degrees of free don 
2 g 
(ng = 1) So 


Similarly, the quantity 


is distributed as a X?-variable With 


Vo = No — 1 degrees of freedom. Hence, F is a ratio of tw 
chi-square random variables, each divided by its respective degrees of 
freedom. The distribution of this statistic (ratio) is known as the 
Snedecor’s F-distribution or sometimes, the variance ratio distribution 
The F-distribution has vı = nı ~ 1 degrees of freedom for 
and v = ng — 1 degrees of freedom for the denominator, I 
to note that the F-distribution has two 
the degrees of freedom in that order. 
important as it has broad applications in 


o independent 


the numerator 


t is interesting | 
parameters, namely Vy and v, | 
The F-distribution is extremely 


modern statistical analysis. 
19.2 THE F-DISTRIBUTION 


2 2 . . . 
Let sı and Sy be the unbiased estimated variances of two random 
Samples of sizes nı and no, drawn from normal populations with same 
2 


f s 
variances, Then the ratio F = + may be written as 
s 
2 


2 
S o2 

pe Bim, 
82/02 U/v 


(2, ~1) 5? 
Where Y= + ln : 
o2  'SaxX?-variable with v; = nı>=ld.f. and 
a. am 20 
y “2 = š : 
ae X°-variable with v, = Ng-1ldf. 
To fi 


nd the distribution o 
Since Ų and V 


are independen 
freedom respect 


à t X?-variables with v; and v degrees of 
ively, therezore their Joins 


distribution is 


275 


E —— 
YHE F-DISTRIBUTION AND STATISTICAL INFERENC 
HE F- 


uE 1/D-1 =u) „02/21 ,-v/2 
~ PIT w,/2) ° 22/2 T w/2 
yer/2-1 (v2/2)-1 


O<uyu<@ 


f(u, v) 
= e7 tyr/2 
= 914 42)/2 T (v,/2) T (9/2) . setae 
To obtain the distribution. of F, we make the change o = 


d v=v so that aaa = the Jacobian of 
/ u F, U=U and 
an 


= > 
Py [02 . . 
transformation is 
Sf af ww 
od . = | Ug vz = De 
J 5s 
ou ov . 10 1 
|OF «ðv 


Substituting these values, we get 
wD- oa- 
(Wr) 1 pe! slow fm 1/2 wy 
v - 
ME ay (v,/2) T (vg/2) 


Now f(F) = f f(F, v) dv 


1jv/2 
(v/v yu, /2 pur/a-1 T portoni g l0/vaF+ W/2qy 
(U,/U9) i 

= gete? Ty, /2) Twg/2 9 


-1 
ni nd 
= 2v 1p 41) a 
U Apa 1), so that v = 2 (> 
Let ees Vo 


4 
dv = 2 (7 + 1) dy. 
2 


Then after simplification, we get 
T [v +vg)/2] W/V) 


== @ 
= F/vg) 
eee T (4/2) T (v9/2) ane oo of freedom in the 


dom in the denc: .inator. It 1S usually 
to note that the I -distrib nation 


he 
o2 but depends upon t 


nop peo 
1+02)/2° 


a — 
the required F-distribution with vı 
as 


R free 
numerator and Vg et case 
i Vo). | 
reviated as F(V}, V2 vaira 
i not depend upon the population 
oes i 
two parameters v, and va only 


276 INTRODUCTION TO STATISTICA 
LT 


Fisher’s z-distribution can be obtained by writin HEORy 


dF=2F dz in the above distribvtion. Fisher’s z-distribution « jia 


confused with Fisher’s z-transtormation of r, the Kine re not 


. e 
In practice, we generally use F-statistic as it is more easy ep co-efficient 


more easy to apply. compute and 


19.2.1. Properties of the F-Distribution. The F. 

the following important properties: i 
(i) The F-distribution always ranges from zero to infinit 

y. 


(ii) The mean and the variance of the F- 
degrees of freedom are 


distribution has 


distribution with vı and 
v 
2 


2 
TE ns Uy > 2, and 


2 
2v, (v1 + vg — 2) 
e 


(0, ~ 2)2 (up M >A 


o2 = 


Now the mean of the random variable 


where U and V are ind 
degrees of freedom respe 


F, defined as F = U/y, 


ependent chi-squar, ; va ’ 
i quare var i 
ctively, is given by ariables with vı and v 


Now’ F (2) 7 1 o 
” o ai ew? dy 
0 


i peg so th 
i at = 
2”. U 2y and dy = 2dy, Then 


A 9(%2/2)-1 © 
V? = gna f 0-2) 
2 2/2 2=2)/2-1 
I (vp/2) te y eY dy 


2(¥2/2)-1 
92/2 > 


a Vo—2 
ET tapas * 


E( 


-A 
Pe) ond : 
A De r @2) 5 Yer? -2 

rÐ =p e5] 


1 U 
Taus = 2 E (5) =— 
Ug- 2 


Hence, we see that there is no mean for, Vg < 2. Also the mean is 
independent of v} and is always greater than 1. The result for variance is 
similarly established. 

(iii) The F-distribution for v} >`2, v2 > 2 is unimodal and the mode 
Uo (vy = 2) 


of the distribution with v, (22) is at F = j 
Vy (Ug + 2) 


which is 
always less than 1. 


Gv) ‘The F-distribution is skewed to the right. But as the degrees of 
freedom v} and vg become large, the F-distribution approaches 
the normal distribution. 


(v) . If F has an F-distribution with v, and vo degrees of freedom, then 
1 a er P 
F has an F distribution with vg and v, degrees of freedom. This 
implies that the critical value of F that cuts off a specified area of 
a in the lower tail of the distribution, turns out to be the 
reciprocal of the F value that cuts off the same area in the upper 
tail of the distribution with the degrees of freedom v; and vg 
interchanged, that is the lower and upper tail points are related 
by 


Fiap Y2) = F, Wp vp 


` This property is useful fortesting a value of F<1, where we take 
the reciprocal of F and interchange the degrees of freedom vj, Up. 


When v=o, the distr’bution ; is the same as that of F. It is 


important to note that the degrees of freedom associated with the 
sample variance in the numerator is always stated first. 

(vi) The F-distribution constitutes a wide class of distributions, 
depending upon v; and vp. For .examplg, a t-variable wich n 
degrees of freedom is the ratio of a standard normal variable and 
the square root of a chi-square variable divided by its degrees of 


Z 
freedom (n) i.e. t = : 
x2 /4f 
variable with 1 degrees of freedom. Thus 


_ Fee 
Z/a xin 


The square of Z is a chi-square 


t2? 


278 INTRODUCTION To S 


TATISTICAL y 


which is an F-variable with 1 and n degrees of feai HEORy 
reedo 


the square of a t-variable with n de 
ne squa ! grees of fr 7 
. distribution with v,=1 d.f in the numerator harie has an. p 
U 


denomi 7 =? : 25n df in p. 
minator, ie. F(1,n) = tp). For this relation, t ” the 


: he nu x 
degree of freedom must be 1. Thus F(1, 6) = 72 Merato, 


2 or F 
Kap etc. (6) OF PA, 12) x 


When 2 infini 
Ug tends to infinity, the variance ratio reduces to x? 


v F is distributed as a %?-variable with v i i 
Moreover, when vı=1 and v= i 
P ea The F-Table of Are 
üstribution has been tab 
buti ulated. 
the distribution with v Pi 
symbol F (v, vo). 


degrees of freedom. 


©, the distribution of VF 


t is nor 
as. In view of its import Bite 


1 and vg 


only for the right- 
are computed by 


A Assum s (v De). * 
distributi Ption ee re 
ution can be applied tthe meee, OF F-Distribui 
e follow ribution. The F- 


(i) The in 
Qi) Th Ove samples are independ i assumptions are satisfied 
i e two populati ently and r i 
3 atioy . andom] 
oe distributes, ie Which sampl dame 
C vever, not considered Wada departure fh are selected, are 
rious, += trom normality is, 


THE F-DISTRIBUTION AND STATISTICAL INFERENCE 


Bomraaraner 


H e 
Doe 


Bere ee 
oono hw 


OoOnNNNNNNNNDN 
OCODBNIAAKRWNHHO 


must 


3.44 a 3.12 
_| 5.12 4.26 3.86 3.63 3.48 3.37 3.23 GOP 2.90 


279 


Table 19.1 Percent Points of the F-Distribution 
5 Per cent Points of F, i.e. F (v1, Va) 


199.5 215.7 230.2 234.0 238.9 243.9 249.0 254.3 
18.51 19.00 19.16 19.25 19.30 19.33 19.37 19.41 19.45 19.50 
9.55 9.28 9.12 9.01 8.94 8.84 8.74 8.64 ` 
6.94 6.59 6.39 6.26 6.16 6.04 5.91 5.77 
5.79 5.41 5.19 5.05 4.95 4.82 4.68 4.53 
5.14 4.76 4.53 4.39 4.28 4.15 4.00 3.84 
4.74 4.35 412 3.97 3.87 3.73 3.57 3.41 3.23 
4.46 4.07 3.84 3.69 3.58 


4.96 4.10 
3.98 
3.88 
3.80 
3.74 
3.68 
5.63 
3.59 
3.55 
3.52 
3.49 
3.47 
3.44 
3.42 
3.40 
3.38 
3.37 
3.35 
3.34 
3.33 
3.32 
3.23 
3.15 
3.07 2.68 1.83 
2.99 2.60 1.73 


Lower 5 per cent points arc found by interchange of vy cnd vg, Le, Vi 
always correspond with the greater mean square. í 


3.71 
3.59 
3.49 
3.41 
3.34 
3.29 
3.24 
3.20 
3.16 
3.13 
3.10 
3.07 
3.05 
.3.03 
3.01 
2.99 
2.98 
2.96 
2.95 
2.93 
2.92 
2.84 
2.76 


3.48 
3.36 
3.26 
3.18 
3.11 
3.06 
3.01 
2.96 
2.93 
2.90 
2.87 
2.84 
2.82 
2.80 
2.78 
2.76 
2.74 
2.73 
2.71 
2.70 
2.69 
2.61 
2.52 
2.45 
2.37 


3.33 
3.20 
3.11 
3.03 
2.96 
2.90 
2.85 
2.81 
2.77 
2.74 
2.71 
2.68 
2.66 
2.64 
2.62 
2.60 
2.59 
2.57 
2.56 
2.54 
2.53 
2.45 
2.37 
2.29 
2.21 


3.22 
3.09 
3.00 
2.92 
2.85 
2.79 
2.74 
2.70 
2.66 
2.63 
2.60 
2.57 
2.55 
2.53 
2.51 
2.49 
2.47 
2.46 
2.44 
2.43 
2.42 
2.34 
2.25 
2.17 
2.10 


3.07 
2.95 
2.85 
2.77 
2.70 
2.64 
2.59 
2.55 
2.51 
2.48 
2.45 
2.42 
2.40 
2.38 
2.36 
2.34 
2.32 
2.30 
2.29 
2.28 
2.27 
2.18 
2.30 
2.62 
1.¢4 


2.91 
2.79 
2.69 
2.60 
2.53 
2.48 
2.42 
2.38 
2.34 
2.31 
2.28 
2.25 
2.23 
2.20 
2.18 
2.16 
2.15 
2.13 
2.12 
2.10 
2.09 
2.00 
1.92 


2.74 
2.61 
2.50 
2.42 
2.35 2.13 
2.29. 
2.24 
2.19 
2.15 
2.11 
2.08 
2.05 
2.03 
2.00 
1.98 
1.96 
1.95 
1.93 
1.91 
1.90 
1.89 
1.79 
1.70 
1.61 
1.52 


"Table 19.1 is taken from Table V of Fisher and Yates; Statistical Tables 


for Biological, Agricultural, and Medical Research, published by Oliver & Boyd 
Ltd., Edinburgh, and reproduced by permission of the authors and publishers.” 


—————— 


INTRODUCTION TO STATISTICAL THEO 
RY 


280 
Table 19.2 Percent Points of the F-Distribution 
9.5 per cent Points of F, ie. Fo o25 Uy V2) THE F-DISTRIBUTION AND STATISTICAL INFERENCE 281 
2 3 4 5 6 8 12 24° ' Table 19.3 Percent Points of the F-Distribution 
1 Per cent Points of F, i.e. F (U4, U9) 


1 2 3 4 5 6 8 12 24 œ 


956.7 


647.8 799.5 864.2 899.6 $21.8 937.1 976.7 997.2 1018 


1 
2 138.51 39.00 39.17 39.25 39,30 39.33 39.37 39.41 39.46 59.50 —————— or 
3 (1744 16.04 1544 15.10 14.88 14.73 14.54 14.34 14.12 13.9 1 aosa aoao 5405 5625 5764 5858 5982 6108 wie 6366 
4 li1222 10.65 9.98 9.60 9.36 9.20 8.98 8.75 8.51 pe 2 (98.50 99.00 99.17 99.25 99.30 99.33 99.37 99.42 99.46 99.5¢ 
= lioo 843 1.76 739 7.15 6.98 6.76 6.52 6.28 rs 6 3 [34.12 30.82 29.46 28.71 28.24 27.91 27.49 27.05 26.60 26.12 
S Vgs 726 660 623 6.99 582 5.60 5.37 5.12 02 a [2120 18.00 16.69 15.98 15.52 15.21 14,80 14.37 13.93 13.46 
> lec7 654 589 552 529 5.12 4.90 4.67 ne 4.85 5 16.26 13.27 12.06 11.39 10.97 10.67 10.29 9.89 9.47 9.02 
3 |757 606 542 5.05 482 4.65 4.43 rae AQ 4.14 e 113.74 10.92 9.78 9.15 8.75 8.47 8.10 7.72 7.31 6.88 
o loai ET BIS aTh AM E 3.95 3.67 7 |1225 9.55 8.45 7.85 7.46 7.19 6.84 6.47 6.07 5.65 
oo lege cae ae ae mir E 3.87 3.61 3.33 g |1126 8.65 7.59 7.01 6.63 6.37 6.03 5.67 5.28 4.86 
n lech Boe 468 oo een 85 3.62 3.37 3.08 9 |1056 8.02 6.99 6.42 6.06 5.80 5.47 5.11 4.73 431 
aao Gah kat diS ay ane a Hes a 2.88 16 756 6.55 5.99 5.64 5.39 5.06 4.71 4.33 3.91 
. ; ; : 02 2.72 11 | 9.65 7.20 6.22 5.67 5.82 5.07 4.74 4.40 4.02 3.61 
so ne Ep 3.77 3.60 3.39 3.15 2.89. 2.60 12 | 9.33 6.93 5.95 5.41 506 4.82 4.50 4.16 3.78 3.36 
enh ih ATS S 3.66 3.50 3.29 3.05 2.79 2.49 13 | 9.07 6.70 5.74 5.20 4.86 4.62 4.30 3.96 3.59 3.16 
ae A 15 3.80 3.58 3.41 3.20 2.95 2.70 2.40 14 | 8.86 6.51 5.56 5.03 4.69 446 4.14 3.80 3.43 3.00 
12 4.69 4.08 3.73 3.50 3.34 3.12 2.89 2.63 2 15 | 8.68 6.36 5.42 4.89 4.56 432 4.00 3.67 3.29 2.87 
6.04 462 401 366 3.44 3.28 306 2.82 2.56 ve 16 | 8.53 623 5.29 4.77 4.44 420 3.89 3.55 3.18 2.75 
l # 25 17 |8.40 6.11 5.18 4.67 434 4.10 3.79 3.45 3.08 2.65 


8.28 6.01 5.09 4.58 4.25 4.01 3.71 3.37 3.03 2.57 
8.18 5.93 5.01 4.50 4.17 3.94 3.63 3.30 2.92 2.49 
8.10 5.85 4.94 4.43 4.10 3.87 3.56 3.23 2.86 2.42 
8.02 5.18 4.87 4.37 4.04 3.81 3.51 3.17 2.80 2.36 
7.94 5.72 4.82 4.31 3.99 3.76 3.45 3.12 2.75 2.31 
7.88 5.66 4.76 4.26 3.94 3.71 3.41 3.07 2.70 2.26 
7.82 5.61 4.72 4.22 3.90 3.67 3.36 3.03 2.66 2.21 
7.77 5.57 4.68 4.18 3.86 3.63 3.32 2.99' 2.62 2.17 
7.712 5.53 4.64 4.14 3.82 3.59 3.29 2.96 2.58 2.13 
7.68 5.49 4.60 4.11 3.78 3.56 3.26 2.93 2.55 2.10 


Ae jia i 3.61 3.38 3.22 3.01 2.77 2.50 2.19 
a rae ne 3.56 3.33 3.17 2.96 2.72 2.45 2.13 
a ap = p 3.29 3.13 2.91 2.68 2.41 2.09 
re rae 3.25 3.09 2.87 2.64 2.37 2.04 
beet = a 3.22 3.05 2.84 2.60 2.33 2.00 
oer i ae 3.18 3.02 2.81 2.57 2.30 1.97 
a ce s 8 315 2.99 2.78 2.54 2.27 1.94 
ae ao 3.13 2.97 2.75 2.51 2.24 1.91 
ee adi = 33 3.10 2.94 2.73 2.49 2.22 1.88 

; 3.31 3.08 2.92 2.71 2.47 2.19 1.85 


NwNnwwp 
a Nor re 
A oD S rrr 


to 
[o>] 


bo 
Q 


28 15.61 4.22 3 
ar, | .63 3.29 3 
fo 75.59 4.20 3.61 06 2.90 2.69 2.45 2.17 1.83 7.64 5.45 94.57 4.07 3.75 3.53 3.23 2.90 2.52 2.06 
30 3.27 3.04 2.88 
88 2.67 2.43 2.15 1.81 760 5.42 454 4.04 3.73 3.50 3.20 2.87 2.49 2.03 
7.56 5.39 4.51 4.02 3.70 3.47 3.17 2.84 2.47 2.01 


5.57 4.18 

P57 418 359 825 308 287 265 241 214 179 

5.49 3.93 334 301 2.90 2.74 2.53 2.29 2.01 1.64 

rie tee cae S01 2.79 263 2.41 2.17 1.88 1.48 
23 289 267 252 230 205 1.76 131 


3.12 
T a ga aio 1m 164 1m 


Lower 2.5 per c 
i d ent points a 
must always correspond wi re found by interchz ; 
w ange of A i a : 
d with the greater m E vz and vg, Èe» VI Lower 1 per cent points are found by interchange of vy and vg, Le, VI 


can square. 
must always correspond with the greater mean square. 
“Table 19.3 is taken from Table V of Fisher and Yates; Statistical Tables 


for Biological, Agricultural, and Medical Research, published by Oliver & Boyd 
Ltd., Edinburgh, and reproduced by permission of the authors and publishers." 


A 
© 


7.31 5.18 4.31 3.83 3.51 3.29 2.99 2.66 2.29 1.80 
7.08 4.98 4.13 3.65 3.34 3.12 2.82 2.50 2.12 1.60 
120 | 6.85 4.79 3.95 3.48 3.17 2.96 2.66 2.34 1.94 1.38 


6.64 4.60 3.78 3.32 3.02 2.80 2.51 2.18 1.79 1.00 


Q 

A 
areoNVNNNYYNNYNBESS 
SPEZRBREAGSaKeanrooe 


Ip 


INTRODUCTIO 


282 
19.3 CONFIDENCE INTERVAL FOR THE VARIA 
NCE 


NT 
oSTATISTICg 


L TH, 


Rang, t 


from norm i i -i 
two norn al populations with variances oF and g2 No be ú 
2) and let = eh 
l 


be the unbiased sample esti 2 2 
ple estimates of O} and oz. Then w 


Let two. independent random samples of size 
n 
1 and 


e kno l 
oD s W that 
= ——;— is a x?-vari i 
i a 
m (m-1)s; 
ü z isa %?-variable wi z 
> with vp = ny -1 df 


Thus, the ratio 


2, 2 22 
pals BA oa 
V. 2 soe 

2 2 2 
6/0, 0} 5) 


as a sl . 
h ni distribution with VU and U de rees of fr eedom 


To construct a (1 — Q) 100 
: . 2, 2 
variance ratio fe} /0 wen d tw 
] v ee fe) 


ay in the low 
distribution wit 
out to be F 
following pro 


PIF 


er 
per cent confidence interval for the 


| iti 
E itical values that cut off an area of 
e upper tail respectively of the F. 


hv 
1 ând vs degr 
2 degrees of fr 
l-u/2 Yj, vo) an reedom. If these tw 
bability sink dF a/2 Up Uy), then o values turn 
ent (see figure given bel we can make the 
elow) 


1-4/2 Uiu) < F 
<F 
a/2 (vy, Vo)] = l-a 


283 


THE F-DISTRIBUTION AND STATISTICAL INFERENCE 
2 2 ` l 


Oy 5; 
P [Fya/2 (Vy, Ug) <77 $ Fuso (Vy va)] =1-a 
9 $9 


by s3 Ish and then inverting 


Multiplying each term inside the bracket 
quality signs when terms are 


each term (we inverse the direction of ine 
inverted), we obtain 


2 ` 
1 
=1-a@ 


2 2 
P (4 a S < 21 < =o a a 
s? Faj2 Uy V2) 35 s Fi-a/2 Up 22) 
— F z ; 
But © Pier pvp) a/2 Ua Vy) [property (v)] 
2 2 2 
s o, S 
1 1 
< A Fu Wy vy] =1-a@ 


P(=.>=—— 
s F2 (v P Ug) oz s3 


Hence a (1-0) 100 percent confidence interval for 33/55 is given by 


2 2 

S 1 Si 
L4. -=a wav] 
$ Fyj2 Vp Ya) $ a/2 V271 


We can also find a confidence interval for 6,/Sq by taking the square 


root of the endpoints of this interval. 
Given two random samples of size n,=12 and 


dent normal populations, with s,=2.3 and 


l for o?/03 and 04/92- 


Example 19.1. 
ng=10 from two indepen 


52=1.5, find a 90% confidence interva 


The 90% confidence interval for the ratio o?/03 is 


2 2 
sS 1 Si 
(4. =m wav] 
s% Faj2 Up 22) $ a/2 V2? "1 


= (1.5)? = 2.25, a=0.10, y,=12-1=11 


Here s? = (2.3)? = 5.29, s? 
and vọ=10-1=9. 


Consulting the 
(9, 11) = 2.90. Substituting these values, 


5.29( 1 \ 5.29 . 
l 5 ( J 2.25 (2.90) | or (0.76, 6.81) 


F-table, we find that Fo.95 (11, 9) = 3.10 and Fọ.05 


we get 


3.10 


AS 


Then 


2_ 2. 
Oy) is true, the test 


INTRODUCTION To tat } 
ISTig 


284 
i aini 
Hence the 90% confidence interval for 01/03 is (0.76, "is y 
Taking the square root of the end points (0.76, 6 i : 
“£9, 0.81) 


confidence interval for 0/0, as (0.87, 2.61) > We pot 4 
“ey, 


19.4 TESTS BASED CN F-DISTRIBUTION 


The following tests of hypotheses are based on the F. 
e Fedicta 
(i) Testing a hypothesis about the equality of tw distributi, 
(ii) Testing a hypothesis about the equality of pi Variances 
(k 


means, 
S >2) Popula, 


(iii) Testing a hy ; . 
pothesis about lineari 
. iv) earity of regres sion. 


The discussi 

ins A wl the hypotheses (a) Stated at (ij 
poutpenred antil the aig of three or more populati (ii), ie, tk 
A c apter, wher E lon me í 
important te ter, where we d ans, į 
UDE, Airea in statistical analysis nii One of the my 
Stated at (iii) and (iv), will be Aao bex analysis q 
ered in Cha 


Testing hypotheses 
about various cor f 
rrelation co : 
-efficients ; 


2,2 a t the two S 
1/03 = 1 or equivalently Botao? baa are equal (ut 
1 = O3). Let sı and s, denote the 


, ba 
sed on ¥)=n)~1 and Və=ng—1 degrees of freedom. 
he (n - 1) s? , 
a is distributed as a X*-variable with v; d.f., and 
Ve (ng ~ 1) s , 
z ìs distributed as a x?-variable with vg d.f. 
By definition p. Efi = 81/0; 


V/v 2 


2? (s? > s? 
So / o, 1 So). 


Assuming that our null 


Statistic becomes 


Pn ae A i 
oO: o;/o5=1 (that is, Ho * 


THE F-DISTRIBUTION AND STATISTICAL INFERENCE 


which has an F- 


255 
ae 
1 
F= 2 
S2 


distribution with v, and v degrees of freedom. The 
will be relatively close to 1. If it turns 


computed value of F, if Ho is true, 
iderably smaller than 1 if 


out to be considerably larger than 1 (or consi 
s2 is not placed in the numerator), it will suggest that O 


larger 


y 

i at 2 
that is, O} # Op. 

The procedure for testing a hypothesis 
oi and o are equal, consists of the following steps: 


(i) 


*/35 #1 
that the population variances 


Formulate the null hypothesis as Ho: o?/o3=1 (that is Ho: 


ot=03). The alternative hypothesis may be 


(a) Hi: o?/03> 1, or (b) Hy: o?/03<1 or (c) Hy: o4/o;#1. 


(ii) 
(iii) 


(iv) 
(v) 


Decide on the significance level &. 


The test-statistic to use is 

2 
sI 2. 2 
72 where s} 1s larger than Sq . 
59 
ch, if Hg is true, has an F-distribution with v, and v2 


Fe 


whi 
degrees of freedom. 
Calculate the value of F from the sample data. 

Determine the critical region of size & from the right tail of F- 

distribution with v} and vg degrees of freedom. 

(a) When H, is o4/05>1 (i.e. Hy: o5>05), the critical region 
will be F 2 F(U p Vg). 
(b) When H, is o?/03< 1 (i.e. Hy: o? <05); we interchange the 
role of two samples and use F= s/s% then the critical 
region will be F 2 Fg, Vy): 


r , 22 age ğ 
When H; is o?/03+1 (i.e. Hy: 51403), the critical region 


(c) 
F 2_ 2 
willbe F2 Fav Vg), when s} > So, OF 


2. a2 
F 2 Fajo V) when s; > $4- 


28 


a UCTION > 
This procedure avoids the ys Sanon, 
However, if one wishes to u e of the DN 
se the left E kt 
nd 


test) also, then the critical rep; 
gion wi tail 
and F < 1/F, (vq, v1). Will be p > pi 
h, 


(vi) Decide as below: 


Reject Hy if the computed value of p 


region, accept Hy otherwise, falls in th 
t 
t 


Example 1¢.2. Given two random s 


nọ=10 from two independent normal pop amples of ¢ 


ulation size Ns 
S9=1.5, test at 0.05 level of significance the ivanell rel 

l j ? o 5 : 
against the alternative H or/ o >1 a Hyo 


G) We state our null and alternative — a 
es as 


H0? / ep wa i 
0 0,/0,=1 (that is, Hy: 07 = o2 and 
2% 


H): o /Ż>1 i 4 
l 1/92>1 (thatis, H) :0? > 6%) 
Gi) The level of significance is setat &=0.05 


(iii) The test-statistic to use is 


2 
Fea 
=>, wher 25 , 
s? e s, is larger than s2 
2 29 
which, if H 


o is true, h 
degrees of freedom as an F- 


(iv) c distribution with v,=11 any 
: om, i 
k putations, Substituting the valu 
F = (2:3)? 5.99 
a (1.5)? © 995 = 2.35. 
e€ Critical regi 
5 p cal region is F> p (11 
Mig 0.0511, 9) = 3.10 
a . Since . 
a region, we - computed value of F does not fall int 
Gai. ude tha here „erefore do not reject Hy at c1=0.05 ant si 
lances are equal 1S sufficient evidence t indicat ch nae 
Example al. e to indicate tha 
Populations are; 


es, we get 


(v) 
(vi) 


3. Two 
random sa i 
mple ‘i norm 
Sample A ples drawn from two 


, 26, 27 l 
Sample 1: 27,35, go aq C> 22, 18, 24, 25 and 19. 
0 95, 32, 34, 38, 28, 41, 43, 30, and 31- 


- 


THE F-DISTRIBUTION AND STATISTICAL INFERENCE l 2£7 


Obtain the estimates of variances of the populations and test 
whether the two populations have the same variance. 
(P.U., B.A./B.Sc. 1974) 
(i) We state our null and alternative hypotheses as i 


Ho: of / o; =1 (that is, Hy: ot = 0%), and 
H,:0?/02+1 (thatis, H}: 01 #02). 
(ii) We choose the level of significance at &=0.05. 
(iii) The test-statistic to use is 
= E (si > Sq) 


__ which, if Ho is true, has an F-distribution with v, and vg d.f. 
(iv) Computations. The two sums of squares are 


_ 2 ŒX)? 
(Xii - X,)? = LX); a 


2 
= 4960 — a = 4960 — 4840 = 120, and 
2 _ EX)? 


EX ~ Xa)? = XX3 ng 


2 
15014 — a = 15014-14700 = 314. 


Now we find the two estimates as 


Xu- žo? 
go = RAHA _ 120 18,88, and 
n,-1 9 
_~¥,y? 
ps eee LT 
2 Ng-1 11 


Since s? is larger than sî, we therefore interchange the roles of 
2 A 


s 3 
tes 2 ae 
the two samples and use the test statistic F = -7 - Substituting 
s 
1 


28.55 _ 


the values, we get F = i333 2.14. 


Reject Hy 


: if the 
Te com 
Exa amA Ho iea value of 
nap ample 16.2, Gi me tly, 
2=10 from two inden." A i, 
$2=1.5, test “pendent norm ca Mples 
at 0.05 leve] f a Populat; of ih 
against the al °t significane tons, yg 
e alternati 2 e, the h I 
P ive {1:07 /075 i YPothes. i 
i) We sta ged. y 
te our 
no A null and alternative hypoth 
0:0 sae a Othese 
1/0, 1 (that is, Hy: 07 a2 = 
A, ; o? / 2 l` 95), and 
1/ O2 > 1 (that j 
atis, H œ, 2 
(ii) 1: 0; > 95). 


The leve] of signifi 


(iii) The test. cance is set at &=0.05, 


statistic to use fg 
s? 
=l 

2» Where s 
S2 


F= 2. 
1 İS larger than 85 


whi . : 
ich, if Hy Is true, has an F- 


degrees of freedom distribution with vlla 


» (iv) Com i 
l Putations, Substituting the values, we get 
F = (2.3)? _ 5.29 | 
k (1.5)? ` 2.95 = 2.35. 
vy) T ‘iti 
he critica} region is F > pP 
E 0.05(11, 9) = 3.10. 


c 5 
Onclusion, Since the co 


sie region, we therefor 

onclude that there is suffi 

variances are equal, i 
Example 

populations are: 


mputed value of F does not hli" 
e do not reject Hy at a06 
cient evidence to indicate that 


19. 
9.3. Two random samples drawn from W0 w 


Sample 7: 2 ' 
0, 16, 26, 27, 23, 22, 18, 24, 25 and 19. 


S A 
ample Il: 27, 83, 42, 35, 82, 34, 38, 28, 41, 43, 30, and 3. 


THE F-DISTRIBUTION AND STATISTICAL INFERENCE 


whether the two populations have the same variance. 


2e7 


Obtain the estimates of variances of the populations and test 


(P.U., B.A./B.Sc. 1974) 

(i) We state our null and alternative hypotheses as ; 
Ho:0?/02=1 (thatis, Họ:0} = 03), and 
H;: o? / 054 1 (thatis, Hy: o? #54). 

(ii) We choose the level of significance at &=0.05. 


(iii) The test-statistic to use is 


which, if Ho is true, has an F-distribution with v} and vg d.f. 
(iv) Computations. The two sums of squares are 


= 2 (XX, i)? 
D(X; - X)? = EXT a 


2 
= 4960 - = = 4960 — 4840 = 120, and 


S 2 Xp? 
Ey — XQ)" = IXY- n 


2 
15014 — a = 15014-14700 = 314. 


Now we find the two estimates as 


-AP 
s = DK Aa _ B14. 20.56. 
nga 1 11 
Since s? is larger than sh we therefore interchange the roles of 
2 ie 


3 see 
xe 2 ees 
the two samples and use the test statistic F = -5 - Substituting 
s 
1 
28.55 


the values, we get F = 73.33 ~ 2.14. 


E ooo 
This procedure avoids the use of the left-hand +, 


However, if one wishes to use the left hand tail (ty 10-Sides 

test) also, then the critical region will be F > Fav, Y À 

and F < 1/Fjo(Vp, V)). es 
(vi) Decide as below: 


Reject Ho if the computed value of F falls in the criti 
region, accept Hp otherwise. E 


Example 16.2. Given two random samples of size n,=12 a 
ng=10 from two independent normal populations, with $,=2.3 s 


$9=1.5, test at 0.05 level of significance, the hypothesis Ho:07/02 l 
l ; . 1 27 
against the alternative H;:01/02> 1. 
(i) We state our null and alternative hypotheses as 
ne j a ; ; 
Ho: om / O, = 1 (thatis, Hy: of = 33), and 


2 


ae BE ' 
H,:0;/6,>1 (thatis, Hy :0) > 0). 


(ii) The level of significance is set at Qa=0.05. 


(iii) The test-statistic to use is 
2 
$1 2 
F= Ei where si is larger than s 
2 


2 
Q? 


which if Ho 1s true has an ł -dist bu 1 Ww =11 and Ug=9 
? , 
rı tion ith v 9 


, (iv) i Computations. Substituting the values, we get 


p- 2:3)? _ 5.29 
(1.5)2 2.95 = 2.35. 


(v) The critical regionis F > Fo 95(11, 9) = 3.10 


(vi) C i i 
_ Conclusion. Since the computed value of F does not fall in the 


critical regi 
conid a therefore do not reject Hy at &=0.05 and may 
å ere is suffici i 5 ` 
variances are equal. cient evidence to indicate that the two 
Example 


19.3. Two r 
populations are: andom samples drawn from two normal 


hie I: 20, 16, 26, 27, 23, 22, 18, 24, 25 and 19 
ample I: 27, 33, 42, 35, 32, 34, 38, 28, 41. 43, 30, and 37 
, $ ? ? z 


INTRODUCTION TO STATISTICAL 
T : 
HEORy THE F-DISTRIBUTION AND STATISTICAL INFERENCE 


whether the two populations have the same varia 


2&7 


Obtain the estimates of variances of the populations and test 
nce. 


(P.U., B.A./B.Sc. 1974) 

(i) We state our null and alternative hypotheses as 
Hy:07/02=1 (thatis, Hy: 0; = 9%) and 
H,:0;/ 05% 1 (that is, H,:0, #64). 

(ii) We choose the level of significance at a=0.05. 


(iii) The test-statistic to use is 


which, if Hp is true, has an F-distribution with v} and vg d.f. 


(iv) Computations. The two sums of squares are 
D(X; -— X))? = 2X; - ET 


2 
60 — = = 4960 — 4840 = 120, and 


2 ÈX; 
ZX; aa 


LX pr X,)? 


15014 — 


2 
ae = 15014-14700 = 314. 


Now we find the two estimates as 


_¥.)2 
s = 2 aA = = 13.33, and 


ny-l 

_X,)2 
go = DMO" _ 314 _ 28,55. 
2 ng 1 11 


Since s7 is larger than si we therefore interchange the roles of 
2 S 


the two samples and use the test statistic F = E . Substituting 
1 


28.55 


= = 2.14. 
the values, we get F 13.33 2.1 


7 


286 T 
-a “ONTO Stanig 
This procedure avoids the use of Tica, 
' However, if one wishes to use the left ae lett, us 
and taj al 


20 
ae ` test) also, then the critical region will b 
and F < 1/F,/9(V9, V1). eF> Pari 
v (vi) Decide as below: 


v Reject Hp if the computed value of p 


region, accept Hp otherwise. falls in the oy, 
Example 1¢.2. Given two rando 
‘ m sampl ; ; 

No=10 from two i bohh 
j independent normal populations w.c° M5R 
? With s s23 


$2=1.5, test at 0.05 level of signi 
gnificance, the h t 
, Ypothesis g 2, 8 
Hooft 


—~ nm DRO dD 


against the alternative H 107/ oF >1. 
(i) We state our null and alternative hypotheses 
Hy: 0/0, = 1 (thatis, H o? = ¢? 
2 » 19:0) = 04), and 
H 0 / o? i 
1: 0, > 1 o? 
1/ 93 (that is, Hy: oj > 03). 


a The level of significance is set at &=0.05 
(iii) The test-statistic to use is E 


3° 


1 
+>, wher 2: 
2 ere s} is larger than 85 ; 
which, if H, į 
. 01S true, h PEPR 
üv) degrees of freedom, 0 P F-distribution with v,=11 and v59 
» (iv) 


F= 


Com i 
putations, Substituting the valu 


F = 23)? 5.99 
(LB) = bg = 2-35. 


es, we get 


(V) The Critical regi 
(vi) Conclusion egion is Fy Fo.o5(11 Jasa 
l . Si i 10, 
critical re n, eal Computed value of F does not fall in the 
He ela that there iozefore do not reject Hy at @=0.05 and may 
arlances are egüal IS sufficient evidence tnt i =0.00 a “we 
Example . indicate that the 


Populations are: 9:3. Two Yandom 
i Samples drawn from two normal 


THE F-DISTRIBUTION AND STATISTICAL INFERENCE 2€7 


Obtain the estimates of variances of the populations and test 
whether the two populations have the same variance. 
(P.U., B.A./B.Sc. 1974) 
(i) We state our null and alternative hypotheses as i 

2 72 l : 
Ho: 9i / O, = 1 (thatis, Ho: oi = 05); and 
27 2 P 2. 2 
H,:6,/6,#1 (thatis, H, : O} #69). 
(ii) We choose the level of significance at &=0.05. 


(iii) The test-statistic to use is 


which, if Hg is true, has an F-distribution with v; and vg d.f. 


(iv) Computations. The two sums of squares are 


EX)? 


ny 


> 2 
LXi -X,)? = 2X; 


2 
= 4960 — a = 4960 — 4840 = 120, and 


z 2 ÈX? 
Iy — Xe)? = y- 


1 


2 
15014 — a = 15014-14700 = 314. 


Now we find the two estimates as 


_¥.)2 
BP i 2K - Xi)" _ 120 _ 13.33, and 
l ny-1 9 

.— X,)2 
Aa PETAN ee) 
2 ng- 1 LL 


Since s? is larger than st, we therefore interchange the roles of 


—. Substituting 


the two samples and use the test statistic F = -3 
s 
1 


28.55 _ 


the values, we get F = 1333” 2.14. 


normal populations with means 4l 
perform the tests as below: 


INTRODUCTION TO STATISTICAL THEO 
B88 $$$ nc THEORY 


(v) The critical region is F > Fo 995(11, 9) = 3.92 (".” vg=11, v4=9) 
(vi) Conclusion. Since the computed value of F does not fall in 
critical region, so we do not reject Ho and may conclude that ile 
two populations have the same variance. 
Alternatively. If we wish to use a two sided test, then the critica] 
region will be 


F 2 Fo 095 (9, 11) = 3.59, and 


1 1 
= -> = 0.26, 
$ Fo ozs Ai, 9) ~ 3.92 = 9-26 
1 _ 18.33 
Now Fa 2" ggg 04T 
2 . 


Decision. The computed. value F=0.47 falls in the acceptance 


region so we do not reject Ho and conclude that the two populations have 
equal variances. 


Example 19.4. In an experiment on reaction times in seconds of 
two individuals A and B, measured under identical conditions, the 
following results were obtained: 


io 
0.41, 0.38, 0.37, 0.42, 0.35, 0.38 l 


0.32, 0.36, 0.38, 0.33, 0.38 


(a) Test the hypothesis at 0.05 level of significance that Ho: o 


2 
=0; 
; 22 
against H} : O40} .. 


2 : ; : 
(b) If Ho: 04=05 1s accepted in part(a), then test the hypothesis at 


0.05 level of significance that Ho: |l4=plg against Ay: |ty#up. 


from two 
; 2 2 
a end Hg and variances oO, and Oy, we 


Assuming the reaction times as two random samples 


(a) (i) We state our null and alternative hypothesis as 


2 
Hy: 0,4 =o% and Hy :0, #02, 
(ii) The level of Significance is set at q = 0.05. 
giii The test-statistic to use is 


; 289 
E F-D Si RIBUTION AND STATISTICAL INFERZNCE 
; STATISTICAL INFERENCE O O o o o oo 
THES SUE : — 
S 
A 
Foy 
Sp 


(v) 


(vi) 


(b) 


(i) 


which, if Ho is true, follows an F-distribution with v,=5 
aud vo=4 d.f. 


‘iv) Computetions: 


2 
raaz ZAP] _ 11 9 goog - 230") -0.00067 
42 A = = [0.8927 -45 


=l 
n7- 


(1.77)2]_, 
1 (ZB)? _1 , — LTD] = 0.00078 
2 [5p -m =5 [0.6297 5 


så 0.00067 
oa EO = 0.86 
F = -3 = 9.00078 
SB 


The critical region is- F > Fp 995(5, 4) = 9.36, and 
: 1 
`Fooz5(4, 5) 7.39 
t fall in 
Conclusion. Since the calculated value F = 0.86 an aa 
the critice! region, 30 we accept Ho, and conclude cl: 
population variances are equal. 


A it would, , 
Since the hypothesis Hg : O, = Og is accepted, i 


F 0.14 


: u4 = p-, using “he two 
therefore, now be appropriate to test Hy: Uy = H», using 
sample t-test. 


We stat? our hypotheses as 


290 ‘ INTRODUCTION TO STATISTICAL THEORy 
— wy 


Ho: U4 = Ug against H,: Ha * Up 
(ii) The level of significance is set at a = 0.05. 


(iii) i The test-statistic to use is 


t = Á 
palad” 
P Vn, No 


which, if Hp is true, follows a t-distribution with v = 9 d.f. 


Gv) Computations: To calculate t, we need to comprte 3° which is 
p’ 
2 
(my ~ 1) s4 + (ne - 1) 82 


ny +ng-2 


I 


0.0268 
0.385 — 0.354 0.031 


t= = = 
0.0268 1/6 + 1/5 0.016 7 194 


(v) The critical region is |t| > £0.025,(9) = 2.26 


(vi) Conclusion; Since the calculated value ¢ = 1.94 does not fall in 


the critical region, so we acce 
pt Ho and conclud 
means are equal, ° - ae 


EXERCISES 


Dalits the F-statistic and F-distribution. Mention what 
taco we make in using the F-distribution. What 
Sa aaa do we test with an F-distribution? Discuss them 

y. (P.U., B.A/B.Sc. 1991) 


19. ; i 
9.2 (a) Define the variance-ratio or the F-distribution and sketch out 
its derivation. 


201 


(b) Describe some of the i 
) e important pr i © Me 
ae portant properties of the F 


19.3 i 
(a) Find the mean and the variance of an F-random variable 
with v, and Vo degrees of freedom, 


b p s R ER ‘ 
(b) Let 2 have an F-distribvtion with parameters vı and vg. 


Then prove that 1/F has an F-distribution with parameters 
v2 and v}. 


w 0.00335 + 0.00312 
= 00812 
N 6+5-2 


THE /*-DISTRIE*“1ON AND STATISTICAL INFERENCE Mai 
ast a FERENCE O = 
19.4. . (e) Given two random samples trom two norel populations 


with varia:ces of and ci, explain how you will find the 


confidence interval for 34/03. 


` (b) Given ny =No=16, s?=50, aad s3=16, construct a 99% 


confidence interval for the variance ratio 64/9}. 


` (c) Given n,=41, n3=13, sis 15.6, and s?=6.3, construct a 98 per 


cent confidence interval Zor 34/05. 


19.5 (a) Under what conditions is the sampling distribution of s/s 
an F distribution ? Explain the relationship between the F 
and ¢ distributions, between the F and x? distributions. 


(b) Given two random samples of siz2 n,=9 and no=16 from two 
independent normal populations, with s;=6 and $9=5, find 


98% confidence intervals for oio; and 0/09. 


(c) A random sample of 10 salt-water fish had a variance, si in 
girth of 7.2 (inches)?, while a random sample of 8 fresh-water 
fish had a variance, s3 in girth of 3.6 (inches)?, Find a 90 per 
cent confidence interval for the ratio between the two 
variances 03/07. Assume normal populatiozs. 


19.6 (a) Describe how you would test the equality of tws variances. 


(b) Given the following information, what is your conclusion in. : 
testing each of the indicated null hypoth zses? 


my Ng ? s? a Ho Hı 


2,2 2,2 
o;/0,=1 ¢/o>1 


22 
Gi) 13 41 6 01 oi/oj=1 oł/oż<1 
ra 2 2 2 o2 
Gii) 60 120 80 170 002 ciso, oo? 
19.7 (a) Two independent random samples of size n,=10 and 2.=7 
B A 2 2 
were observed to have sample variances of sį=16 and s373. 


i 2 a oe 
Using a 10%-level of significance, test Hy : 0] = O, agains: 


292 


19.8 


INTRODUCTION TO STATISTICAL THEORy 


Hy Ge P Oot Then using a 5% level of significance, test 


fi . RRD 2 
e 1H ot =) against H] 07 > 53 and H}: 0) < Oy. 
(P.U., B.A/B.Sc. 1996, I.U., M.Sc. 1990) 
(b) Two samples are randomly selected from two classes of 


students’ who ‘have been taught by different methods. An 
examination is given and the results are shown as follows: 


Class I Class II 
` Saiple Size: =~ ° 0 8 10 
Means: : 95 97- 
Unbiased sample variances: ` 47 80 


_ Test whether the. two different. methods of teaching are 
equally variable. (P.U., B.A/B.Sc. 1988) 


A standardized placement test in mathematics was given to 25- 
` boys and 16 girls. The boys made an average grade of 82 with a 


standard deviation of 8, while the girls made an average grade of 
78 with a ‘standard’ deviation of 7. Test the hypothesis that 


2 s á å 
Ho iO, = op against the alternative hypothesis Hı: o? # 35, 


2 ; i : $ 
.. Where ©, and o? are the variances. of the populations of grades 


19.9 


19.10 


- No=5 from. population,2. The data are ‘shown below: 


: Unwashed yarn: 
Washed yarn: 


for all boys. and all, girls respectively.. Use a: 0.02 level of 
significance, (P.U., B.A/B.Sc. 1986) 


Independent random samples were selected. from each of two 


normally distributed’ populations, n,=6 from population 1 and 


Sample 1: 3.1, 4.4, 1.2,1.7, 0.7, 3.4 

Sample 2: 2.3, 1.4, 3.7, 8.9, 5.5, 
Do these data Provide sufficient evidence to indicate a difference 
between the population variarices? Use a=0.05. . 


The following data give the percentage extension under a given 
load of two independent. random samples of yarn, the first before 
washing, the second after six washings, 

12.3, 13.7, 10.4, 11.4, 14.9, 12.6 


15:7, 10.3, 12.6, 14.5, 12:6, 18.8, 11.9. ` 


_ Assuming that both samples come from normal distributions, 


` >» test whether therc is a significant Ccifference between the two 


- samples: 


THE F-DISTRIBUTION AND STATISTICAL INFERENCE 293 


19.11 


19.12 


(a) as regards variability. 
(b) as regards the mean percentage extension. 
(LU, M.Sc. 1988; P.U., B.A/B.Sc. 1993) 


The prrcent moisture content in a puffed ce.al where samples 
are taken from two different "guns" showed 


Gun I: 3.6, 3.8, 3.6, 3.3, 3.7, 3.4 
Gun II: 3.7, 3.9, 4.2, 4.2, 4.9; 3.6, 3.5, 4.0. 


Test the hynothesis of equal variances and equal means. Use any 
ascumptions you believe appropriate. (P.U., B.A/B.Sc. 1991) 


Two methods of determining moisture content of samples of 
canned corn have been proposed and both have been used to 
make determinations on proportions taken from each of 21 cans. 
Method J is easier to apply but appears to be more variable than 
method II. If che variability of Method I were not more than 25 
per cent greater than thet of Method II, we would prefer 
method J. Based on the followiag sample results, which method 
would you recommend? 


nı = ng = 21; J = 50; Jo = 53° 

ZO- Fy)? = 120; Lo; - Fa)? = 340 (P.U., M.Sc. 1970) 

Hint: Test Ho: of = 1.2503 against the alternative 
2 
1 


Hy: of > 1.25 2. 


Under the null hypothesis, s?/1.25 s, is an F-distributi«n with l 
2,2 i 
vı=v2=20. l’. If Ho : 0} / 03 = k, then F = (82/5 (1/°9). 


te ote o% eo 02o oto fe o 0% 4%% 
Oot 9o? gO 259 00 00 090 Fe oo 0% 


The Analysis of Variance 
eee 


20.1 INTRODUCTION 


Earliez, we compared two population means by using a two-sample 
t-test. However, we are often required to compare more than two 
population means simultaneously. We might be tempted to apply the 
two-samp!e t-test to all possible pairwise comparisons of means. For 


example, if we wish to compare 4 population means, there will be “i =6 


separate pairs and to test the null hypothesis that all four nopulation 
means are equal, would require six two-sample t-tests. Similarly, to test 
the null hypothesis that 10 population means are equal, we would need 


(7) )=45 separate two-sainple t-tests. This sort of running multiple two- 


sample t-tests for comparing means has two disadvantages. First, the 
procedure is tedious and time consuming, and secondly, the overall level 
of significance greatly increases as the number of t-tests increases. Thus 
a series of ¢+vo-sample t-tests is not an appropriate procedure to test the 
equality of several means simultaneously. 


Evidently, we require a procedure for carrying out a test on several 
means simultaneously. One such procedure is the nalysis of variance, 
introduced by Sir R.A. Fisher (1890-1962) in 1923. The analysis of 
variance (abbreviated as ANOVA) is a technique that partitions the total 
variation—a term distinct from variance ane measured by the sur: of 
Squares of deviations from the mean—into its component parts, each of 
which is associated with a different source o“ variation. These covaponent 
parts of variance are then analysed (hence the name, analysis of 
variance) in such a manner that certain hypotheses can be testec. This 
technique is based on the facts that (i) the more the sample means differ 
the larger the variance decomes, ar? (ii) the separate components 
provide independent and unbiase? estiinatz. of th: common population - 


296 INTRODUCTION TO STATISTICAL THEORY 


variance. The anlaysis of variance procedure therefore compares two 
different estimates of variance by using F-distribution to determine 
whether the population means ave equal. The analysis of variance has 
been shown the most powerful and useful technique whenever the 
statistical data can be categorised in groups. 


When each observation is classified into one sample or another 
according to a single criterion, we have a one-way classification while the 
classification of each observation on the basis of two criteria of 
classification, is called a two-way classification. In a similar way, a multi. 
way classification is defined. We discuss the analysis of variance 
procedures for the first two classifications only as a one-way analysi 
of variance and a two-way analysis of variance respectively, j 


20.2 ONE- E 


; The one-way analysis of variance is also called the one-variable-of- 
classification analysis of variance. The data are classified into k classes 


Suppose we have k samples of equal size r (the case of unecual 
sample sizes will be discussed ater), selected randomly and 
independently, one from each of k normal populations with means 


H i He, .... LL, and common variance 62; and we wish to test the null 


hypothesis that all the k-population means are equal, i.e, 


s Uy PHI = Ug =... = Ly, 
against the alternative hypothesis 
n P ~H, : Notall means are equal. 
fi et X;; denote the ith observation of the jth sample (or treatment) 
en the data can be arranged as in table below: l 


Samples (or Treatments) 


_ 


THE ANALYSIS OF VARIANCE 297 
ts Se 


Here X,j, X.., T.j and T.. represent the mean of thé jth sample, the 
overall or grand mean, total of observations in the jth sample and the 
total of allfrk=n Jobservations respectively, where the dot replaces the 
subscript ove ` which we have summed, 

We test the hypothesis by comparing two independent estimates of 
the common population variance g2. The estimates of the variance can 
be obtained in various ways. ‘ 


(i) The first estimate of the common population variance O2 is 
evidently obtained by pooling the k sample vari Thus the 
; 2 a 

pooled estimate of 6”, denoted by sp is given by 


2 Dai + (rD s + n + (1) 8? 


S where n=rk 
p n-k , 


1 2 >o S z 
EEn [> KaK. + D Kig Kip)? + ... 


i=l i=l 


E i 
+ 2 (Xj,-X.,)? ] 
i=] 


Later, this estimate will be referred to as the within samples 
estimate of variance 6”. This is an unbiased estimate regardless of the 
fact whether or not the null hypothesis is true. 

(ii) The second estimate is based on the variation among the sample 


means assuming. that all the population means are equal. 


Theoretically, the variance of the mean of a sample of size r is 
2 


: ; 2 0 2 
©o?/r. Therefore, using the relation Oz =» we get o? =r.. 


If s is an unbiased estimate of O, then an estimate of o? will be 


; à 2 : , 
Let us denote this estimate by s, as later we will call it the 


between sample estimate of variance 6. This estimate is independent 
of the within sample estimate as it is obtained using the means of the 
samples. However, this estimate will become greater than the estimate 


298 
Scoot ERE RETO _—_ INTRODUCTION TO STATISTICAL yu 
o tained by pooling the sample varierces when the samp] AL YHEORy 
- `= not true. ple Means diffe 
r 


A third estimate denoted oy Sm can alsc be obteined by t 
= Teatin, 
8 


} l ] g l é a2 
the data as one large samp e consisting of n obser vations by t 
relation the 


This is also an i 
unbiased estimate 
of o? when H, i 
o 1S true. Thi 
. s 


estimate is not of use in sin the esult b 

anal p 
i i ° ysl g r l S but can be used to simplify 
the computat. ons ` i 


As s? and s? ; 
aiins kd mig re ie independent unbiased estimates of 6%, so thei 
Bing x Menan greatly. To detect this, ie. to test the h , i ‘ 

k» We use the ratio pothesis 
k =- 
r X (È; -Ž.)?/k-1) 


i ; 
J=li=1 


? f 1 
0 
which 1 H 1s tr ue, has an F distribution with U 
degrees of freedom, we will reject Ho when 2 2 i a 
that the population means are not equal. It Is impor tan 


k-1 and Vo=n-k 


20.2.1 I art ry wee = 
it. onin the S mG Squares Equal Sampl . 
g ~ ? 
e Sizes 


— variance O2 may be obtai db 
resent in the -sam laiza 
T à th ples (of equal si 
he variation of all n=kr p ee 
s 


gnd i ' a. 
1s called the total sum of squa | 
res, 


To partiti tal vaviat 0. n u th 
10n this to 
latı n, let us co str ct 
e following 


Xy-X, = È) + (È 
Xj- 2) + (X; -X..), 


HE ANALYSIS OF VARIANCE i ` 
che D LA sssssssssssstu$ll 


299 


k r = k r _ RF ot 
Z Ej- =} LxKy-Xp? + LT YK; -X..)? 
j=li=1 j=li=1 j=li=1 


i k r _ = os 
jelisl 


The cross-product term vanishes, because for each j 


r 
. o £. «2 2X; 
Lexy Xp = UXy-r*y = LAg~? ae =0 
i=l i=l i=l 


The second term may be written as 


k r k 
E Lay -X..)2 =r Lj -X..)? 
| jeliel jel 
because the summation does not have i as a subscript and the factor not 


containing an i is considered as constant. 
Hence we get the following sum of squares identity 


k r k 
=} Ej- +r (X.j-X..)? 
jel 

which indicates that the total variation present in the samples can be 
partitioned into two parts. The first part is the sum of squares of 
deviations of the observations from the sample mean and is called ‘the 
within (samples) sum of squares. It is also known as the error sum 
of squares. The second part is the weighted sum of the squares of 
deviations of the sample means from the grand mean and is called the 


between sum of squares. We can thus represent the sum.of squares 
identity symbolically by the quation 


Total SS = Within SS + Between SS 
rrespective of the fact 


k r _ 
È Ej- 


jeliel jeliel 


It is important to note that this identity holds i 


whether or not the null hypothesis Hp: Hy = H2 =. 
imate of o2 by dividing the 


.. = pp is true. 


Now, we obtain an unbiased esti 


Between SS by its degrees of freedom, i.e. by k-1 as there are k 
te of c2 is obtained by dividing *he 


ber of degrees of freedom which is 


k(r—1) or n-k as there are k samples, each containing r observat.ons. 
These two quantities are known as "nean squares and are denoted by 
MSB and MSW or MSE respectively. Therefore, to test the null: 


samples. A second unbiased estima 
within SS by an appropriate num 


2&0 


aaa 


Peep 
WOnNrPRoMPOrIanh WNE 


14 


sss RA SS NI be 


esti i i ; 
timate is not of use in analysing the results but can Beran ue. This 
i o 


298 INTRODUCTION TO STATI 


Beas a ee re STI 

obtained by pooling the sample varierces when the samp] CAL THE OR, 

: “= not true. ble meang digg 
ey 


A third estimate denoted py s7, can alsc be obteined b 


the data as one large sample consisti 
n . 
relation — sanman 
F E 1 k r = 
m= ea 2 È (Xj —X,.)2 
j=1 i=; 


This is also an unbi i 
iased estimate of o2 when Ho is tr 


the com i i 
; putations. ~ inpli 
2 2 
Ass 
p and S, are the independent unbiased estimates of g2 


a (X: -= X.)2 77 
ij j=1 i=] y j) /(n k) 
Which, if H, i 
degrees of a an F-distribution with’ v,=2-1 and 
that the - We will reject Ho when P> p o nd vy=n-k 
~~ Population means are not eq, = FW}, vg) and conclude 


e variation 
ss of 
ured by the expression 


y treatin 
ons by th, 


„variance O2 may be obtained by 
populations. Th “Samples (of equal size) 
all n=kr observations 


THE ANALYSIS OF VARIANCE ` ; 299 
k r > 2 k r _ k r k 
2 2 Kij =X.) = z 2 Xij -xX.;)? + xy x X; —X..)?2 
j=li=1 j=li=1 j=li=1 


. kor 
+2 x È Xj -Xy) (Xj — X..) . 
j=1li=1 
The cross-product term vanishes, because for each j 


5 
r = r 7 r Lx 

Xj- = LXy-r Xj = LXy-r (H+ ]=0 
i=l i=l i=l p 


The second term may be written as 
k r _ ok - _ 
L LEj-X..) =r FX.) 
. Jzlicl jel 
because the summation does not have i as a subscript and the factor not 
containing an i is considered as constant. 


Hence we get the following sum of squares identity 


k r = k r = k = 
ÈE Laj- E Ej- +r E Kj - XI? | 
j=1i=1 j=li=1 j=1 


which indicates that the total variation present in the samples can be 
partitioned into two parts. The first part is the sum of squares of 
deviations of the observations from the sample mean and is called the 
within (samples) sum of squares. It is also known as the error sum 
of squares. The second part is the weighted sum of the squares of 
deviations of the sample means from the grand mean and is called the 
between sum of squares. We can thus represent the sum.of squares 
identity symbolically by the equation 
Total SS = Within SS + Between SS 


It is important to note that this identity holds irrespective of the fact 


whether or not the null hypothesis Hy : p; = Hg = «-. = Hp is true. 


Now, we obtain an unbiased estimate of o? by dividing the 
Between SS by its degrees of freedom, ie. by k—-1 as there are k 
samples. A second unbiased estimate of c? is obtained by dividing the 
within SS by an appropriate number of degrees of freedom which is 
k(r—1) or n—k as there are k samples, each containing r observat.ons. 
These two quantities are known as "nean squares and are denoted by 
MSB and MSW or MSE respectively. Therefore, to test the null. 


300 


hypothesis Hp : Hy = Hg =... = [ly against the alternative Hy: not 1 
means are equal, we form the ratio a a 
- MSB - estimated variance from between SS 
. MSW estimate variance from within SS 
_ 


which, if Hp is true, has an F-distribution with ¥i=k-I1 and vy=n 
degrees of freedom. It has already been stated that, when Hp is oh tru : 
e, 


MSB will be larger than 6%, we will therefore reject Ho at the a level 
significance, if F > FW, V9). À 


k r 
E a? ae [=i ie 
[x 2 Oy te lanii y, n=rk) 


= E ((n- 1) 52] = (y —1) 02, 


kor _ k È &y - X,)? 
LE [> 2 (Xj ~¥,)2] = YE lot" 
J=1i=1 j=1 r=1 


k 4 
+ U0- 10? m-ko ana 


k 
El» (Xj -X2] = i > 
Z J 2 i Pe {%~w -&, ~w}7] 


k £ 
=E [ > {E - w? + +192 = 
j=l 


2X; - wk, - w}] 


k 
= Ly AW? 0K.) 20%, 2] 
J= 


INTRODUCTION TO STATISTICAL THEOR“ 


at 


THE ANALYSIS OF-VARIANCE 301 
k — -— 
= Lr EX — )?-n EĞ. -p2 


o2 o2 


r—-n— 
ja T n 
= ko? — 0? = (k — 1) o2, 
Substituting these values in the sum of squares identity, we get 
(n — 1) 0? = (n - k) 0? + (k — 1) o2 
Dividing both sides by o2, we obtain 
(n-1) = (n—k) + (R-1) 
Clearly the total number of degrees of freedom is n—1 as there is only 
one restriction of computing the grand mean. The d.f. for k samples is 
k—1, because the mean of the sample means must equal the grand mean. 
Similarly, the d.f. for within SS is n-k, due to the k restrictions of 
computing the k-sample means. Hence we find that 
Total df = Within df + Between df 
20.2.3. The Analysis of Variance Table. The various sources of 
variation, degrees of freedom, the sums of squares and mean squares 
associated with the sources are generally shown in a table, called an 
analysis of variance table or ANOVA table. This table is used in 


testing the hypothesis that the population means differ. For one-way 
analysis of variance, with k samples of r observations each, the analysis 


of variance table is shown below: 
Analysis of Variance Table 


Source of 
Variation 


Between 
Samples 


Within 
Samples 
(Error) 


302 7 - 
“The procedure for testing the hypothesis Hy: Hy = Me =.= 
using one-way analysis of variance (sa:nples of equal sizes r) ig hy 
below: i 

(i) Formulate the null and alternative hypotheses a; 


INTRODUCTION TO STATISTIC _ THE Sp 
— E 


Ao: Mi = Ue =. = Up, and 
H, : Not all k means ere equal. 


(ii) Decide upon a significance. level q. 
(iii) The test-statistic is 


wh H 1 [ 5 (2-3 
ere 5, = ari r 2 (x, -3..)2] » and 
J= 


1 
n-k 


E $- 


i=lj=1 


2 
Sw 


4 the two estimates of the common variance 6%, if Hp is true 
he F-statistic, if Hy is true, has an F-distribution with v,=k-1, 
and vy=n—k degrees of freedom. 


(iv) Compute the necess: 
ary Sums of Squares and 
analysis of variance table. Also compute F-ratio. eine 


(v) Determine the critical region, which will consist of all values 
greater chan or equal to F (k-1, n—k), 
(vi) Decide as below: 


Reject Hy if F falls in the critical region, accept H, o otherwise 


20.2.4. Alternative Computi 
puting Formulas. i 
the sums of Squares can be simplified as Geto wen gett a 


Total SS = ET (x-2)? 
ty 
R 2 2 x 
PE (Ky + - 2x.) 


Ppa- EE, 


THE ANALYSIS OF VARIANCE 303 
-EEX -n pE- 
"g i j n 


<: 


(sum of all values)? 


= sum of squares of all - 3 
g siot all values number of values - 


Between SS = Yr (Xj -X..)2 


2 
2 
G p oy r, 


=r} ar ae CU ark) | 
The Within SS or SSE is usually obtained by subtracting the 
i 2 
T 


Between SS from Total SS. The term = is generally called a 


correction factor (abbreviated as CF) as the deviations are taken from 
the grand mean. The arithmetic can further be simplified by choosing a 
convenient origin as all the SS are independent of origin. 


Example 20.1 Given the data below, test the hyp othesis that the 


means of the three populations are equal. Let @=0.05. 


(i) We state our null and alternative hypotheses as 
Ho: Hy = Ue = Hg Le. all the three means are equal, and 


H, : Not all three means are equal. 


(ii) The significance level is set at Q = 0.05. 


(iii) The test-statistic to use is 


which, ‘if Hp is true, has an F-distribution with v,;=k-1 and 


v2=n—k degrees of freedom. 


304 INTRODU “TION TO STATISTICAL THEO 


RY 
a enn a es, — THE ANALYSIS OF VARIANCE 305 
‘ (iv) The computations are carried out as below: E as 


20.2.5. One-Way Analysis of Variance. Unequsl soc? 
2X51 Sizes. In the preceding sections, we discussed the one-way analysis oi 
variance for the situation in which the k samples were all of the or 
size r. But, generally, the sizes of th samples are not equal. Let the 


Sample 2 Sample 3 


2 
Xiz Xi) Zia (Xia) 


! ; 
Donn | eee | eee one rardom samples be of sizes r}, ro, ..., r respectively with 2 rj = n. The 
50 (2500) 65 (4225) 38 (1444) == 8169 , , E j=1 P 
60 (3600) 66 (4358) 60 (3600) --- 11556 1. sum of squares identity wou!d then be written with a stight modifica 

65 (4228) | 50 (2500) | 42 (1764) Si 8489 Í ye 


46225 34225 | 143451 
) 8833 
T U 


The formulas for computing the Total SS and Between SS are given 


below: 
9 
2 2 
ea =  (651)2 kor a? T 
Correction Factor (C.F) = — = (651)? = P 2 a 
nip = 35316.75 Total . SS=- > bx- 
l a j=1i=1 
Total  SS=5ExX -Op 
—S a yr y ; T? T? 
= 36739 ~ 35316.75 = 1422,%5 Between SS= Z —Ż-—, 
2 i mg n 
pe 
J 


Between SS = 1- C.F. 


The Within SS is obtained by subtraction as before. For degrees of 


| freedom, we replace rk by n, therefore the respective d.f. are n—1, k-1 
143451 


— 35316.75 = 546.00 and oe and n—k. The rest of the ar.alysis is the same. 
i _ ri " ny makes four xinds of light 
Within SS = Total SS-—B xample 20.2.) Suppose a corzpany differaness in the 
— ra ~ 1422.25 — 546.00 = 97:25; c bulbs and it ‘s desired to test whether there are ay diffe 2 


+e Analysis of Variance table is: 


-2 


i = k j =7 
durabilities of the bulbs, Random samples of sizes n, =5, al 0, ng 
and n4=5 are selected and the following results are obta:ned: 


8 


i 2 
2 2 3a2 s? = 28,5 =54, 
E -- X;-= Xo = X, = 17,%, = 22,5 = 10, 5, = Sia? Pa TA 
| Setween Samples Xj = 14, Xo = 26, %3 »%4 1 


ic p ion vari from the sample. 
lle whee ai j geai the service lives 
at lysis of variance to 
rs Fens differ from one a::uther at œ =0.01. 
1422.25 | | of the four kinds of bulbs do not diffe 


. eranl ; .A./B.Sc. 1986) 
(v) The critical region ig’ > Fo.o5(2, 9)]= 4.26 7 : (P.U., B.A./ 


(vi) Conclusion Since calculated y i h corresponding to’ the 
$ as alue of F=2,80 does not f. ; ll and the alternative hypotheses 
in the ritica reglon ge teat value of n esi @® The null an ice li bulbs do no: differ from one 
} ull_hypott lives of the bulbs i 
conclude that all the thr oth a problem that the service 


== means are eg another, are formulated as 


iin Samples 876.25 


Total Variation 


y 


3 
3 | 305 
INTRODU CTION To STATISTICAL THE 
7g e below: Y THE ANALYSIS OF VARIANCE ei a a es es 
(iv) _The computations are carried out as below: STK, OneWay Annies of Variance. Unequ arent 
iki oe the preceding sections, we discussed hs rai inaia 
y Sizes: ituation in which the samples w } 
: the situation in whic "Let the k 
Xia (X, A eS ee acai: the sizes of thi samples are not ee Le 
-z iki ' The 
j , i ; tively with Z rj = n. 
40 (1600) 70 (4900) 45 (2025) random samples be of sizes r}, ro, ..., r, respec! A oa 
Aen Pan Ae identity would then be written with a stight modifica 
60 (3609) | 66 (4358) | 60 (36b0) i sum of squares ident 
65 (4225) 50 (2500) 42 (1764) Aaa is ; 
} = 3 a 
$ Sa, -X..)2 = a 2 (Xj - X.) + p> rj Xj —X..) 
isije i=1j=1 jal 
are given 
The formulas for computing the Total SS and Between SS 
e fo 
below: 
2 
T 
T 2 kost 
: n 651) , an 
Correction Factor (C.F) = ie ey = 35316.75 Total . SS= x Èx a 
aaa J=li= 
’ _ 2 _ ; 7 i 
Total SS = PPX, C.F. | ro 
= 36739 — 35316.75 = 1422,25 Between SS "2 Fon 
7 ion as before. For degrees o 
eT, ths is obtained by subtraction as bef ong ie 
K thin SS is o ive d.f. are n-1, 
Seen iB EI oe adn lace rk by n, therefore the respective d.f. 
freedom, we replace ion 
i | 143451 i d n—k. The rest of the ar.alysis is the same. i 
= —~ 3) -75 = 546.00, Co sila a four xinds of lig 
4 35316.75 = 546 0, and A ample 20.2.) Suppose a sie pes enim ih te 
: > ther there . 
Within SS = Total SS — Between SS = 1422.25 — 546.00 = 876.25. i veere 22) to test whe 
The Analysis of Variance table is: ‘ 


i i =.0, ng=7 
les of sizes n,=5, no 
ities of the bulbs, Random samples of ned 
ee boat de a. following results are ohta 
and n4= 


L i=54 
2 a 332, s3 = 28, s4=54, 
14, Zp = 26, Z3 = 17,34 = 22, 8; = 10, s3 = 335 S3 i 
~ Ry= 14, Z; = 26,3 
546.00 | 273.00 | 273.00 


97.36 “80 


Cc 


Source of d.f. Sum of Computed 
Variation Squares F 
—— 
Between Samples _ -2 0 


Within Samples 


le. 
in rani from the samp 
, 5 lation variance ee ae 
W tig) 4 imate qË the population . the service liv 
he pene variance to determine whether 

876.25 Perform je 


: z:uther at 0.=0.01. 
he four kinds of bulbs do not differ from one a?:v ma 
_—— . 7 
(vV) The critical region ig r > 10-05(2, 9)]= 4.26 a 
(vi) Conclusion, Since the calculated value of F=2.80 does not fall 
in the critical re ion, 


ding to’ the 
; heses correspon 

Iternative hypot! :; differ from one 
i i he null and the a erma bulbs do no: j 
SQ_We accept our null hypothesi d j vate that the service lives of the 

conclude that al] the three Means are aq; 


another, are formulated as 


Total Variation 


} 


Within ‘gg . 


306 ji 
oon | 


INTRODUCTION TO STATISTICAL The 
smat Ory 
Ho: by = H2 = H3 = h}, and 


H; : Not all four means are equal, 
(ii) The Significance level is set at &=0.01 
(iii) i 


The test-statistic to use is 


which, if H, ; 
o 1S true, has an F-distribn t; 

= “distribu : 

V2=n—k degrees of freedom. tion with ¥1=k-1 ang 


(iv) 


sums of squares, 


ecessa 
mean Ž.. as ompute the grand (overall) 
Xx. = MX] + nð, t NX, + nX, 
; Nitnotng+n, 
= 2014) + 10(26) + 707) 4 5(22) 559 

Ni 5+10+745 = 97 = 20.7 
Between Sg = > ys 

An A=. tow Sp, 32 x 

J=1 Jy = (Zn X )2/n] 


= [5(14-20.7)24 10(26-20,7)2 


- +7(17-20.7)2 
= 224.45 + 280.90 + 95.83 + 


+5(22-20.7)2] 
8.45 = 609.63, ang 


2 
1? (hg 15? 4 oe, 2 
2 (ng Das + G41) 58 


= 4(10) + 9(334) + 6(28) + 4(54) 


40 + 300 + 168 + 216 = 724 


Source of 
Variation 


TFE ANALYSIS OF VARIANCE a7 
THE ANALYSIS OF 
+). The critical region is F > #9.91(8, 23) = 4.76 


(i) Conclusion. Since the computed value of F=6.46 falls in the 
critical region, we therefore reject Hy. This implies that the data 
present sufficient evidence to indicate that the service lives of the 
four kinds cf bulbs do differ from one another at 0.01 level of 
significance. ‘ 


Example 20.2. The students in 3 classes in an elementary 
statistics course obtained total scores as in the table: 


8-o’clock: 121, 117, 145, 208, 142, 154, 115, 81, 122, 127, 122 
i 10-o’clock: 97, 145, 119, 139, 143, 133, 149, 107, 154. 
2-o’clock: 184, 89, 108, 88, 146, 153, 130, 144, 125, 111, 87, 162. 


Is there a significant difference in the scores received by the 
students meeting at different times of day? State completely the 
hypothesis you are testing and your conclusions. 

(P.U., B.A./B.Sc. Hons., 1961) 


(i) The null and alternative hypotheses corresponding to the 
problem that there is no significant difference in the scores 
received by students meeting at different times of day, are stated 
as 

Ho: Hı = H2 = Ha, and 


H, : Not all the means are equal. 


Gi) “Let us choose the level of significance at «= 0.05. 


(iii) The test-statistic to use is 


which, if Ho is true, has an F-distribution with vj=% 42=4- 
degrees of freedom. a 


(Giv) ‘Computations of sums of squares. To make the computational 
work easier, we choose our origin at X=100. Ther tr.. 
computations are’ given as follows: 


8-o’clock 

91 (441)| -3 (9) 34 (1156) 
17 (289) 45 (2025).| -11 (121) 
45 (2025) | 19 (361) 8 (64) 

3 (64) | 39 (1521) -12 (144) 2454 
42 (1764) | 43 (1849) 46 (211v; 112g 
54 (2916) | 33 (1089) 53 (2809) 5729 
15 (225) | 49 (2401) 30 (900) 6814 
-19 (361) 7 (49) 44 (1936) pes 
92 (484)| 54 (2916) 25 (625) T 
27 (129| =- mx 11 (121) i 
22 (484)» -13 (169) a 

62 (3844) 
76729 


È : 
= _ (817) 
C.F. = « z ~ = 
39 = 20859 
yO 2 ` 
Tota ' SS = pX - C.F. = 36007 — 20859 = 15148 
T? 
Bewween SS = Z —}- C.F. = (ras 81796 _ 76729 
ie u +9 + rt) 


l = (5865 + 9088 + 6394) — 20859 = 488 
Within SS = Total SS — Between SS = 14660. 
The Analysis of Variance table is: 


PN e 
an ae 
Squares Square i 


Between ‘times’ 
Within ‘times’ a 14 pa mr 
[To visson Tai a a 
<v) The critical region is F > Fo-o5(2, 29) = 3.33 
E "05l, = ð. . 


(vi) C ; r 
onclusion. Since the computed value of F=0.48 does not fè 


the critical regi 
g10n, so 
» SO We accept Hy and may conclude gfe” 


is no signific i 
: ant differ ‘ 
times of day. ence in the scores received at 


THE ANALYSIS OF VARIANCE 


a ee 

Example 20.4. The following table contains the scores obtained 

by students in three sections of statistics class. Apply the Analysis of 
Variance technique to test the homogeneity of their achievements. 


Score- Sections ` 


(i) The hypotheses corresponding to the problem that the sections 


are equal in achievement, are formulated as 


Ho: By = He = He and 
H, : Not all three means are equal. 


Gi) The significance level is chosen at &=0.05. 


(iii) The test statistic to use is 
F= Between Mean Square 
= “Within Mean Squre ’ 


which, if Ho is true, has an F-distribution with v,=k-1, vg=n-k 


degrees of freedom. 
utations of sums of squares. For this purpose, 
= 1, 2, 3) for the three sections as: 


(iv) Comp we first 


calculate fx; and fx; g 


Section 3 


at 


i | fox. 
0 0 

49 
26 | 312 |3744 
4 68 | 1156 
3 66 | 1452 


310 INTRODUCTION TO STATISTICAL T 


The sums and sums of squares of scores obtained fr 
are tabulated below to facilitate computational work: 


No. of 
Section | Students 


T 
Now Total SS = X X x? -— = 19144 — £1882)? _ 
n 1 


REX; 121 = 3360.33, 
r? T? 
Between SS = J —Ż-— _ (602)? | (495)? | (285)? (13ga)2 
TË n 46 40 35 121 


= 16324.68 — 15783.67 = 541.01, and 
Within SS is obtained by subtraction. 


The analysis of variance table is therefore: 


Source of 
Variation 


Sum of 
Squares 
Between Sections 


Within Sections 


The critical region is F > Fo-o5(2, 118) = 3.07 
(vi) Conclusion. 


20.2.6. Assumptions of On 


à e-Way Analysi i 
one-way analysis of variance test is y ysis of Variance. The 


M tei based on the following assumptions: 
i e k-samples are selectederand i 

respective populations. Sospan independently from the 
di) All the k populations from whith 


the samples are di: re 
normally distributed with means H n 


p Ug, ..., bly. 


HEORy. 
om each sectio, 


THE ANALYSIS OF VARIANCE 311 


(iii) The normal populations all have equal variances (that is, o? = 


2 2 ; 
O, = ... = O, = 6), The technical tern. for this assumption is 
` Homoscedasticity. 


(iv) The effects are additive. This means that Xij the ith observation 


in the jth sample, is made up of three component quantities as 
follows: 


Aya Ut Gt Ej 


where t is the overall mean, Tj is the sample or treatment effect 


for jth population and E;j is the random error, usually considered 
a normally and independently distributed variable with zero 
mean and common variance oĉ. 


, In practics, these assumptions must be checked before proceeding 
with the "F-test" for the equality of means. Failure of any assumption 
will impair the technique. However, investigations have shown that 
minor deviations from normality and equality of variances are to be 
tolerated. Sometimes, an appropriate transformation such as using the 
square root or logs, etc., is made to satisfy these assumptions. 


20.3 TWO-WAY ANALYSIS OF VARIANCE 


When each observation is classified according to two criteria (or 
variables)of classification simultaneously, we use the two-way analysis of 
variance technique. The classified data are recorded in a table, in which 
the columns represent one criterion (or variable) of the classification and 
the rows represent the other criterion. If there are c-columns and r-rows 
in the table, then there will be altogether rc cells. Each cell may contain 
a single observation or several observations. 


There are two basic forms of two-way analysis of variance, 
depending upon whether the two variables of classification are 
independent or whether they interact. Two variables (or criteria) of 
classification are said to interact when they together have an added effect 
that they do not have individually. For example, suppose that we classify 
salesmen according to, say, age and educational attainment with a view 
to determining whether age and education have significant effect on the 
volume of sales. Suppose we observe that the two variable individuall, 
do not produce significant effects but certain age-groups when combined 
with particular educational attainment, produce significant effects. We 
shen say that there is an interaction between age and educatiunal 
attainment. On the other hand, if a particular combination of certain 
age-group and educational attainment does not produce any significant 


312 INTRODUCTION TO STATISTICAL THEORY 
effect, the variables are independent. When variables of classification are 
independent, one observation per «ell is recorded. In case of interaction 

> 


several observations are made for each cell. We consider both the cases 
în the following subsections. 


20.3.1. Two-Way Analysis of Variance without Interaction 
Let Xij denote an observation in the ith row and the jth column ina table 
consisting of r rows and c columns and containing sample data from 
normal populations with means Hj; and the common variance ‘G2 
classified according to two criteria of classification simultaneously, Let 
Hi. represent the population mean of the ith row and Ļu., that of the jth 
column. Denoting the total and mean of c-values in the ith row by T, 
and X;., the total and mean of r-values in the jth column by T.. and x. 
and the grand total and grand mean by T.. and Xs. the results äle shown 
in the following form: 


There are now two null hypotheses, one corresponding to the problem 

ber 2 pa le n are equal; and the other corresponding to the 
em that all the c-column-means are e ual. Th 

hypotheses are iii aiibi 


H; H = Uo. =.= Hy , 


H: Hep = Hags.. = Hio 


. and the altern-tive hypotheses to be considered, are 


THE ANALYSIS OF VARIANCE 313 ° 


H i Not all Hj. are equal, 


Ld 
H k Not all H.; are equal. 


We test these hypotheses by comparing independent estimates of the 
common population variance G2. The estimates of the variance can be 
obtained by partitioning the total sum of squares into three components 
corresponding to the three possible sources of variation, viz; Between 
Rows, Between Columns and Within Samples or Error. For this 
purpose, we construct the following identity: 
Xij =X. = (X;. ~X..) + (X,; -X..) + (Xij -Ž.. -Žj + X..) 

Squaring both sides and summing over all‘values, we get 


r c = r e = roc = 
> > (Ky -X..)? = xX È (X; -X..)2 + x Ly =X. 
i=l1j=1 i=l1j=1 i=1j=1 , 
í r c _ _ A 
+2 È (Xj; — Xi — Xj + X..)? + three cross-product 
i=l j=l 

terms, which all reduce to zero. 

Since Zi. - Ž..)2 is identically the same for each row of the c values 
and (X,; — X..)? is the same for each observation in the jth column cf r f 
values, the sum-of-squares identity becomes ' 


roc: r _ c E 
L VAj-X.)? =e VK, - KP +r Ej- + 
‘jeljsi ind jel 
r c _ _ _ Ks 
= 2 L Ky -X; -X,; + X..) „Le. 
i=l1j=1 


Total SS = Between Row-mieans SS + Between Column-means SS 
+ Within or Error SS. 


We may write this identity briefly in the following symbolic from 
SST = SSR + SSC + SSE. 


Here we get four estimates of the population variance o2. The first 


estimate is based on r—1 d.f. and is given by 


2_ SSR 
$= 


r=1 


314 INTRODUCTION TO STATISTICAL THEORY 


When the hypothesis that row means are equal, is true, s? is an unbiaseg 
estimate of 62, on the contrery, it will have a larger value. 
The second estimate based on c—1 d.f., is given by 
c= 1 
If the hypothesis that the column means are equal, is true, 8 is an 
unbiased estimate of 0, otherwise it will also have a larger value. 
The third estimate based on (r—1) (c-1) d.f., is given by l 


2 SSE . 
$3 = 
(r-1) (e- 1) 


which is an unbiased estimate of o2 irrespective of the fact whether the 
hypotheses are true or false. 
It has been shown by a thecrem due to Cochran that the two 


R 2 i 
estimates s 1 and si, derived from the “Between row-means SS" and the 


"Between column-mean SS" are independent of ss, the estimate derived 


from the error sum of squares. Hence to test the hypothesis that the 
row-means are equal, we compute the statistic 


which, when the hypothesis is true, has an F-distribution with v)=r-1, 
v = (r-1) (c-1) d.f. We will reject the hypothesis at the œ level of 
significance, when F} > F; [(r—1), (r—1) (c-1)]. , 


Similarly, to test the null hypothesis that the column-means are 
equal, we compute the ratio i 


Kee. 

i 
Pele 
w Nv nw 


which has an F-distribution with vi=c-], v=(r—1)(c—1) d.f., if the null 


hypothesis is true, Hence we reject the null hypothesis at the q level of 
Significance, when 


F3 2 Fa (e-1), (r-1) (c—1)). 


THE ANALYSIS OF VARIANCE 315 


~, It has also been verified that the degrees of freedom associated with 


" different sums of squares, are additive. That is 


(r2-1) = (r-1) + (c-1) + (r-1) (c—2) 
In other words, ; 
Total d.f. = Rows d.f. + Columns c.f. + Error d.f. 
These results are summarised in the following ANOVA Table: 
ANOVA-Table for a Two-Way Analysis of Variance without Interaction 


Source of 
Variation 


r 
SR = c X (X,.-X..)2 


i=l 


In practice, we use the following short-cut methods for computing 
the Total SS, Between Row-means SS and Between Column- 


means SS. 


Total SS= 5 > (Xy-%.)? =D DXi 


i=1j=1 ial jal 
P =A r T? J 
SSR = c > (X;.-X..)* = DT T’ and 
isl] Ce 
2 
3 aw 2% © 
SSC =r X @j-%.)" = T>-— 


The Error SS is usually obtained by subtraction. 
The procedure for testing the null hypotheses is similar to that used 
for one-way analysis of variance. 


316 INTRODUCTION TO STATISTICAL THEORY 


Example 20.5. Four experimenters determine the moisture 
content of samples of a powder, each man taking a sample of six 
consignments. Their assessments are: 


Observers Consignments 


` Perform a two 
whether there is 
between observers. 


-way analysis of variance on these data and discuss 
any significant difference between consignments or 
(P.U., B.A/B.Se. (Hons.) 1970) 


(i) We set up the two null hypotheses corresponding to the problems 


that 


(a) there is no significant difference betwe 


en consignments, 
and 


(b) tkere is no significant difference between observers, as 


Ho: Mey = We = H-3 = Hg = [hs = Lg, and 
Ho Hy. = Hp. = py. = H4: 
The corresponding alternative hypotheses would be 
H, ? Not all H.; are equal, 
HY : Not all Ui. are equal, 


(ii) We choose the level of significance at q = 0.05. 
(iii) The test-statistics to use are 


. . 2 
F, = &Stimated variance from "Between Consignments Ss" ŝi 

1 estimated variance from “Error SS" ~ 52? 
3 


"Error SS" . 


, : she 2 
and p, = Estimated variance from "Between Observers SS" So 
2 estimated variance from E 


THE ANALYSIS OF VARIANCE ` 317 


which have F-distributions with vı=5, v2=15 and v;=3 v=15 
d.f. respectively, when the null hypotheses are true. f 


(iv) Computations. The necessa 


ry sums of squares are computed as 
shown in the table below: 


Consignments (Figures in brackets are the squares of X;;) 


m m. 2795.05 = 9.71, 
Ti r? 
Between Observers SS = L> =a 
- 1580. 2795.04 = 13.13, and 


Error SS = 35.96 — (9.71 + 18.13) = 13.12 


318 INTRODUCTION TO STATST.CAL THEORy 


The ANOVA - Table is 


Source of Variation af. Sum of Mean 
Squares Square 


Between Consignments 5 9.71 
3 
15 


(vi) Conclusion. Since the computed value of Fı=2.23 does not fall 
in the critical region but the computed value of F2=5.03 falls in 


Between Observers 


Error 


r-rows and c-columns, thus having re cells. Supposing ‘that each cell 
contains n observations, the ren observations can then be displayed as in 
the following table, where Xijp will denote the kth observation in the ith 


Tow and the jth column. 
¥ 


Th. 


“qe 


THE ANALYSIS OF VARIANCE 319 


The totals of the observations in the ith row, in the jth column, in. 
the (i, j)th cell and the total of all ren observations are denoted by Ti., 
Tj- Tij- and Tess respectively. 7 i 

Similarly X;.., X.jo Xij. and X... denote the means of the observations 
in the ith row, in the jth column, in the (i, J)th cell and the mean of all 


ren observations respectively, 


We further assume that the observations in the (i, j)th cell are a 
random sample of size n from a normal population with mean Hy and 
variance ©? and that all re populations have the same variance 62, 

Then there are three hypotheses to be tested, namely 


(i) Ho : that the row-means are equal, against - 
Hy : Not all row-means are equal, 
(ii) Hy : that the column-means are equal, against 
Hi : Not all column-means are equal, and 
Gii) H o : that there is no interaction, against 


H H : that the interaction effects are not all equal to zero. 


We test these hypotheses by comparing independent estimates of 
the ccmmon variance 02. The estimates of the variance are obtained by 
partitioning the total sum of squares into four components, 
corresponding to the four possible sources of variation, viz., Between 
Columns, Between Rows, an Interaction and an additional part 
involving the variation Within cells about the cell means. 

For this purpose, we construct the following identity: _ 
(Xj,-X...) = Èj...) + (Xj...) + (Ki -X “Kj. +X...) + (XiX) 
Squaring both sides and summing over all values, we get 


F c n = r c n = i r c n N _ 3 
LVL EXE Beak." + EE X Z...) + 
i lk=1 


iz Į=1k= i=1j=1 k=1 i= y=lkžjsr ` 

r c n po es = _ r e n _ 

2.2 DXi Xj XK). 4X...) 2+ >> È KijarXj.)? + 6 cross-product 
i=Įj=1k=1 i=lj=1lk=1 i 


terms that vanish when summed. Hence we are left with after 
simplification 
r è n o è ioe = rT _ = 3 
È È È (Xij Ë...) = rn È (Xj - X...)2 + on Z Či- KX...) 
j i=l 


i=Vj=1k=1 j=l ` 
Pi r c n 


r c = = zi = é 
+n) > ( ij =A Xj. + X...) e by D E Kir = Key, 
SFE i=\j=lk= 


320 INTRODUCTION To STATISTICAL THEO 
We may write this sum-of-squareg identity symbolically as x TEITE EE o ANALYSIS OF VARIANCE 2! 
= = i i i 
Total SS = Between Columns SS + Between Rows SS + When interactions are zero, we may pool the Interaction SS with 


Interaction SS 4 Error Ss the Error SS. This increases the number of df for error and hence 


or bss briefly as SST = SSC + SSR + SS(RC) + SSE. increases the precision of further testing. 
€ corresponding breakd es gt i 
e p & breakdown of the degrees of freedom is given by ) El the sums of squares are computed by the following 
or : 
n= Na GoD epe Gp ; 
Dividing these sums of squares b i eee r cn Yio 
y their corresponding numb tal SS = ga 
degrees of freedom, we get the variance estimates as ik ; TAR > a Xin ron’ 
s?a SSC 2_ SSR 2 — SS(RO) 2_ SSE ‘ 
15L% a epg and s = —= T... 
¢-1)(r-1) re(n~1) where ~~ is the correction factor (C.F) 


All these estimates, if all the three null hypotheses are true, provide 
2 


unbiased estimates of o2. Thus the Corresponding variance ratios that w E toj 
compute, are e Between Colum Żmeans SS = > 4 C.F., 
2 2 : j= 
1 82 A f 
Pino) Fo = -3 and Fy = 2, rT? 
à S4 $4 S4 Between Row-means SS = »—- CF., 
which have F-distributions with ¥1=¢-1,v9=re(n~1); vi=r-1,v3=re(n-1) tr 
- po DED, vg=re(n-1) d.f. respectively when the null EE : = A 
ypotheses are true, Interaction SS = > 2am 2 = Peg + CF 
i=l]j=1 iel j=1 


and Error SS as before, is obtained by subtraction. 


Example 20.6. The following data represent'the result of 3 
questions obtained by 3 students in three subjects: 


Subjects 


23 22 


13 
1 18 20 
15 18 


ô 
21 20 
2 16 14 
24 24 
18 17 
3 15 13 21 
12 16 18 


Perform an anlaysis of variance upor these data and tes* the 
hypothesis that (a) the subjects are of equa! difficulty. 
(b) the students are of equal ability, and 


.(c) the students and subjects Jo not interact. 


322 


(a) Hy : The subjects are of equal difficulty, 


between column means is zero: 


: Not all subjects are of equal difficulty: 


: The students have equal ability or the row- 


means are 
equal: 


: Not all students have equal ability: and 


iThere is no interaction between students and 


subjects: 


:The interaction effects are not all equal to zero, 
(ii) 


We choose the level of significance at q = 0.05. 
(iii) 


The test Statistic to use is the usual variance ratio, ie 
Fe estimated variance from “Between SS" 
estimated variance from "Error SS" 


which, if the nul] hypotheses are true, has F- 


Vo d.f. 


(iv) Computations. 


distribution with Vy 


of Tij. The row and colum 


46 (2116) 
61 (3721) 
45 (2025) 


59 (3481) 
“58 (3364) 
46 (2116) 


G5 (4225) 
57 (3249) 
58 (3364) 


14, the differency 


INTRODUCTION TO STATI 223 
i - STICAL THEORy E ANALYSIS OF VARIANCE we et 
(i) We state the null and alternative hypotheses as ee 


_ oo _ (495)? 
.Now correction factor (C.F.) = ren 27 


a = 187 + 182 + 152 +... 
ZEX C.F. = 132 + 


+ 212 + 182 — 9075 


= 9075 


Total SS 


9363 — 9075 = 288. 


2 
Ty. 
82073 
j E = ———- 9075 = 44.22 
Between Subjects SS = X = C.F. 9 
ss F. = Samy 9075 = 44.67 
Between Subjects SS = La C. ‘ “7° 


n 7 
% T Th 


—-2—-2—t+Cr 
Interaction SS = z> = ie 2 7 
_ 27661 = 82077 _ a + 9075 
o 3 9 
= 9220.33—9119.67—9119.22 + 9075.00 =56.44 
SS = SST — SSC - SSR - SS (RC) 


Error i 
l = 228 — (44.22 + 44.67 + 56.44) = 142.67. 


i i low: 
Hence the analysis of variance table is set up as be 


Source of Variation 


[Between Subjects 


44.67 


22.34 
14.11 


‘Between Students 


Tn teraction 


error 


bia i me 


Total 
; = 3.55 

(v) The critical regions are (a) F} 2 Fọo.o5(2, 18) 

(b) F; > Fo.o5(2, 18) = 3.55 

(c) Fy > Fo,o5(4, 18) = 2.93 


INTRODUCTION TO STATISTICAL 


323 
NCE 
7 Ry ALYSIS OF VARIA E 
(i) We state the null and alternative hypotheses as , o 
D . ee A = 9075 
= , à ; tor (C.F.) = = 
(a) Ho: The subjects are of equal difficulty, Le. the differen, Now correction fac Fen 27 
e . E 
: between column means is zero: ss -ZF Z Xij — C.F. = 182 + 182 + 152 +... 
Total LJ y 2 5 
r s : + 21? + 182 — 907 
l Hı: Notall subjects are of equal difficulty; . 
„ = 9363 — 9075 = 288. 
(b) Hy: The students have equal ability or the TOW-meang are T? 
J 82073 = 44.22 
equal: n = } —- CF, = = _ 9075 = 44. 
y Between Subjects SS = > m`? 9 
tr 
H, : Notall students have equal ability: ang T? 
£ a CF. = 82077 _ 9075 = 44.67 
AA $ S = a E CM 
(c) Hy’: There is no interaction between Students ang Between Subjects S PC . 
subjects: 


he usual variance ratio, i.e, 
F = Estimated variance from "Between SS" 

estimated variance from "Error SS" 
which, if the null hypotheses are true, 
Vo d.f. 


has F-distribution with v,, 


(iv) Computations. 
Construct a +ab] 
(i, j)th cell, ie., Tij 
of Tij. The row 


46 (2116) 
61 (3721) 
45 (2025) 


59 (3481) 
‘58 (3364) 
46 (2116) 


G5 (4225) 
57 (3249) 
58 (3364) 


2 
% O È 


Interaction SS = a es a - a + C.F. 
27661 _ 82077 _ 82073 én 
= 3 9 @ 
= 9220.33-9119.67—9119.22+9075.00=56.44 
SS = SST — SSC - SSR - SS (RC) 


eid = 228 — (44.22 + 44.67 + 56.44) = 142.67. 


i i below: 
Hence the analysis of variance table is set up as 


Source of Variation 


—., 


Between Subjects 


‘Between Students 


‘Interaction 
t 


)error 


| 18) = 3.55 

(v) The critical regions are (a) Fy 2 Fo 5(2, i e 
(b) Fo 2 Fo 95(2; 18) = 0. 

(c) Fy > Fo.os(4, 18) = 2.93 


324 l INTRODUCTION TO STATISTICAL THEOR, 
os 
(vi) Conclusion. Since the computed values of F)=2.79, Fy=24 
and F3=1.78 do not fall in the critical regions, therefore We 

accept all three hypotheses. 


20.4 MULTIPLE COMPARISON TESTS 


If F-test after the analysis of variance rejects our null hypothesis, all 
we can conclude is that the k population means are not all equal. This 
conclusion might not be sufficient to satisfy the experimenter, rather he 
would like to know which means (or sets of means) might differ 
significantly from each other. For this purpose, several tests based on 
different view points, have been developed to make comparisons between 
pairs of means. Such tests, used as a follow-up to F-tests, are called the 
Multiple Comparison tests. The commonly used tests include the 
Fisher’s Least Significant Difference (or LSD) test, the Student. 
Newman-Keul’s Multiple Range test, the Duncan’s Multiple Range test, 
the Scheffe’s test, the Tukey’s T-method, etc. We discuss the first four 
tests only. i 


20.4.1 The Least Significant Difference Test. When the null 
hypothesis of equal means is rejected by F-test after the ANOVA, we can 
test the significance of differences between means of k samples (or 
treatments) by using the ordinary two-sample t-test on every pair of the 


= h(k-1) possible pairs of X; and x; (ižj) at significance level a. But this 


procedure involves a large number of decisions, An alternative method of 
dealing with such a situation is to compute the smallest difference that 
would be judged significant and to compare the absolute values of all 
differences of means with it. This smallest difference is called the least 
significant difference or LSD and is given by 


2 (MSE) 
LSD = taj wy) p— 


where MSE is the Error (Within) Mean Square, r is the size of equal 
samples, and tu/2,(v) İS the value of ¢ at O/2 level taken against the error 
degrees oi freedom (v). The test criterion that uses the least significant 
diff. ence is called the LSD test. Two means are declared to come from 


populations with significantly different means, when the absolute value 
of their difference exceeds the LSD. Í 


It is interesting to note that the LSD test is actually a two-sample t- 


test. Two means &; and x; (i4j) differ significantly when the value of t 
computed by 


F 


THE ANALYSIS OF VARIANCE ` 325 
|X; -X;l 
ganmi A 


is greater than or equal to tao forv =n, + No — 2 degrees of freedom. 
In other words, two means X; and Xj are judged significantly different, 
_ 1%: - 


if I T iazo 
A ji pes 
°p ni ng 
= 5 1 1 
-—X.| > aaa pace 
or if 1%; Z;l 7 a/2,(v) Sp V ny + ng 


2 
2 = =n = ize of 
2 ta/2,v) SAJF > when n; ng = r, size 
equal samples. Since in the ANOVA, the Error (Within) Means Square 
provides the pooled unbiased estimate (s*) of the common variance 62, 


the relation becomes 


a a 2 (MSE) 
IX; ~ Xl 2 tayo) A) r e 


The LSD was defined by the right hand term. 


It is customary to arrange the sample means in ascending order of 
magnitude and to draw a line under any pair of adjacent means (or set of 
means) that are not significantly different, eg., 


Xy Xp, Ža Xo Xs Xo 


indicates that ay and Xo, x and Ke; etc. are not significantly different. It 
also implies that there exists a significant difference between the groups 
È, Xo) and (X,, Xy Xo). Since X, is not connected by a line, it differs 
significantly. The LSD test is only applied after the null hypothesis is 
rejected in the ANOVA. 


Example 20.7. Perform the anlaysis of variance on the following 
data and anaiyse the treatment means using the LSD test with a 0.05 
level of significance. 


Treatments 


f 


1 


326 INTRODUCTION To STAT 


ISTICAL THE, 
(i) We formulate our null and alternative hypotheses as È 


Ho: M.i = H-2 = H.3 = H.4 = H. 


5 = H.g, Le. the six treatm 
means are equal , and A A 


H, : Not all six treatment means are equal. 


Gi) The significance level is set at a = 0.05. 


(iii) We use the F-test to accept or reject Ho. If Ho is rejected, th 
L3D-test is used to analyse the treatment mean À 


s: 
(iv) The computations are presented in the following ANOVA.-Table 


Source of Variation d.f. Sum of Mean 
i Squares | Square 
0.37 _ 


Between Treatments 5 
[Between Blocks 3 
. 15 


The critical region for treatments is F1 2 Fy 95(5, 15) = 2.90. 


(vi) Conclusion. We reject the null hypothesis of equal treatment 
means as the computed value of F\=6.61 exceeds the table value 


of F=2.90. As Hy is rejected, using F-test, we therefore apply the 
LSD test to find out which means differ from each other. 
The Least- 


Significant Difference is given by the relation 


252 
LSD = t0.025,(15) \ Eg where s? = 1.57 andr = 4. 


= 2.134 jet 1.89 


Arranging the treatment means (x;) in ascending order of magnitude, 


ana drawing a line under pair of adjacent means (or sets of means) that 
are not significantly different, we have 


x) Xe X5 Xo X3 X4 
1.75 1.75 3.50 4.00 4.75 5.75 
EES ka 
The significantly different voirs are immediately observed from this 
resentation. 


THE ANALYSIS OF VARIANCE 327 


20.4.2. The Student-Newman-Keul’s Multiple Range Test or 
S-N-K Test. The S-N-K multiple range test compares the observed range 
of sample means in the subset with the calculated critical range at the qa 
per cent level of significance. The test procedure uses a modification of 
the Studentized range statistic q defined by 


iat =A nin _ Range 


S.E. of X — afs2/r’ 


where s? (the error mean square in the ANOVA table) is the unbiased 
estimate of 02, and r is the number of observations in each mean. The 
sampling distribution of the q statistic is approximated by y 
Studentized range distribution having parameters k, the number o 
sample means and v, the number of degrees of freedom for MSE. 


q= 


The critical value for the difference between two means which are 


`‘p(p=k, k-l, ..., 3, 2) steps apart on an ordered scale, is 


Wp = qa @,v) \s2/r . 
The value of q, (a modified q statistic in the S-N-K test) for appropriate 
value of p and v are obtained from the Studentized range table of 
significant ranges at the o level desired. It is interesting to note that 


Wo = qa (2, v) Vs2/r, (p=2) 
= tay V2s2/r = LSD 


To carry out the test, arrange the k sample means epi A 
magnitude so that ži < Xo < ... < Xp and calculate the REET 
P = k, i.e. W, by multiplying the studentized range q by a\s?/r a i 
in an ANOVA). Compare the observed range of sample means X, = 2 
with W, If X, - Žž does not exceed Wp, conclude that the ese = 
equal and the test ends there. If X, — X, exceed a piy z 
sample means into two groups of k-1 means Sach, < hae Mi 
Compare the two ranges X, -— X; and X,_) — X; wit y en 
for p=k-1, i.e. W,_}. If either range does not a ia ia We 
the means in the groups to be equal. In case; either m - pee 
divide the means in the group concerned into =e Oe enti he ee 
each and compare with the calculated W,_9 an Saad value W, The 
means is found which does not exceed the ca meee ane 
procesš ends whenever the observed range of a ee Eas tga 
the calculated critical range. This multiple ra scat epee ea 
because each contributed to its formation. It is als 


i t. 
or stepwise procedure and is perhaps the bes 


o. pOr 


fi . . 
re ranges. The & per cent Critical ranges for 
ed the least significant range (Isr), and denoted by R : 
p ar 


P= 2B, ..., k-l,k 


It is of importance to note that in cas 
the harmonic Mean 
= >, isu in ple 

5 (yr Sed in place of r, the number of 
Sample sizes. 


ecm aa Use Duncan’s mul 
pat compare all Pairs of t 


> Th Of all rps, ie 


observations in equal 


tiple range test for the data in 
reatment means. Assume that 


In Example 20.7, we found that 

5? = MSE = 157, 

r (the nuimhey of observatio 

u (df for MSE) =15 
We now find 


usin each treatment) = 4 


and k (number of treatments) = 6 


` 


THE ANALYSIS OF VARIANCE 329 


\MSE/r = \/1.57/4 = 0.6265 


Arranging the treatment means (;) in increasing order of 
magnitude, we get 
Xj Xe X5 X2 %3 Oy 
1.75 1.75 3.50 4.00 4.75 5.75 


The values of qo 95(p, 15) for p = 2, 3, 4, 5, 6-taken from Duncan’s table 
of significant ranges and the least significant ranges, Rp, obtained by 


multiplying qo.05(P, 15) by YMSE/r , are shown below: 


1.89 (R3) 
1.98 (Ry) 
2.04 (Ry) 
2.07 (Rs) 
2.11 (Re) 


Comparing the differences between all pairs of means with the least 
significant ranges, R,, beginning with the largest (x4) against the 
smallest (x,), we have the following results: 


4 versus 1: 5.75 — 1.75 = 4.00 > 2.11 (Rg) 
4 versus 6 : 5.75 — 1.75 = 4.00 > 2.07 (R;) 
4 versus 5 : 5.75 — 3.50 = 2.25 > 2.04 (Ry) 
4 versus 2: 5.75 — 4.00 = 1.75 < 1.98 (R3) 
4 versus 3 : 5.75 — 4.75 = 1.00 < 1.89 (R3) 
3 versus 1 : 4.75 — 1.75 = 3.00 > 2.07 (Rg) 
3 versus 6 : 4.75 — 1.75 = 3.00 > 2.04 (R,) 
3 versus 5 : 4.75 — 3.50 = 1.25 < 1.98 (R3) 
3 versus 2 : 4.75 — 4.00 = 0.75 < 1.89 (Ro) 
2 versus 1: 4.00 — 1.75 = 2.25 > 2.04 (Ry) 
2 versus 6 : 4.00 — 1.75 = 2.25 > 1.98 (R3) 
2 versus 5 : 4.00 — 3.50 = 0.50 < 1.89 (Ra) 
5 versus 1: 3.50 — 1.75 = 1.75 < 1.98 (R3) 
5 versus 6 : 3.50 — 1.75 = 1.75 < 1.98 (Ro) 


aS, 


cc INTRODUCTION TO STATISTIC L 
6 versus 1 : 1.75 — 1.75 = 0 < 1.89 (R,) THEORy 
i 2 


The pairs of means whose differences are greater th i 
a 


THE ANALYSIS OF VARIANCE i 331 


least significant x n th , ; 
ünder seas aay Rp, are significantly different Deo tesPondin Scheffe’s method is used to test the significance of all possible 
l are not significantly differen belme rawing a li contrasts. This method requires the computation of 
z x z m 
be Xg Ys Xo x3 Ey G) S=V(k-1) FW, v2). w = k-1) 
1.75 
3.50 4.00 4.75 5.75 where vj and vg are treatment and error degrees of freedom and 
e nr F is the tabulated value for a significance level & in the F-test of 
rom the anlaysis, we Ho: Ti ==... = T; 
, see that the LS 0'71 25 = tg 
ran D m P 
ge test produce the same conclusions ethod and Duncan’s multiple Gi) the standard error of each contrast to be tested as 


SQ =N |MSE) x rje; , and then 


linear combin s Method. A 
their ation of means with an ess contrast is defined aş es bs 
constant co-efficients must equal ential condition that the sum è Gii) the critical value as 
between two means qual zero. For example, the difference Scheffers value x S(sQ). 


: Hi—H; o iaa 
are contrasts as the a Hj; or the expressio 


Practice to define a contr 


efficients are ei, A contrast, Q is said to differ significantly from zero when the 
» “CIS a common i absolute value of Q exceeds S(sq) i.e., Scheffe’s critical value. We can 


r function 
Q = eT, + CT +. 


-+c T, wher 
or more compactly, kek here c, +o +... 


+c, = 


correspondin S i 
Pia & co-efficients Is equal to zero. Th 
Sere, op) wn 22" 2x7; will be opos 0t ÎS two contrasts 
e in Non-orthogonal contr thogai ör independent if 
gonal contrasts are better asts are sometimes used though 
The sum of l 
Squares for a 
contrast, Q is given by 
SSR =- -2&2 
2 3 
Zne 


Servations in T 
should not ex 
treatment me 
ree of freed 


j Itis important to 
ceed the number of 
ans. Furthermore, a 
om and a t-test is 


also directly test the hypothesis Hy : p; — Hj by the criterion 


IX,-%1 > Ven F «4 [MSE) X c;/r 


20.5 THE ANALYSIS OF VARIANCE MODELS 


A statistical model is a mathematical structure which fully describes 
how the data could be generated. That is, a model describes what an 
observation is composed of. In statistical models, it comprises a mean 
and a random error component, where the mean may involve a single 


parameter |t or a sum of parameters. The parameter p is constant while 
the random error components are assumed to be independently and 


normally distributed with zero mean and a common variance. 
An example of the widely used linear additive model in the ANOVA 
is the expression 
Xj = H+ Yt ey, 
where Xy is the ith observation for the jth sample or treatment, p is the 
overall average effect, T; is the treatment effect undergone by the jth 


sample and e;; are the error terms due to unspecified causes. To specify a 
model completely, some assumptions are made about the treatments. 


For example, when we assume that 
(i) the treatment effects T;’s are fixed and Ly = Z(Hj-H) = 0, the 
model is called a fixed effects model or Model 1: 


Ho: There are no differences among the effects of k treatments 
present in the experiment, and 

Hı: Notall treatment effects are equal. 

In other words, we may state 
Ho: Tt = 0, (for allj) and H, : Some t. x 0. 
The hypotheses under Model II are formulated as 

Hy: There are no differences among the effects of all the 

treatments in the population from which the k treatments 


in the experiment, are a random Sample, 
implying that 


Least-Squares Estimates of Effects in One-Way 
ij represent the ith observation of the Jth Sample or 


“H+ +e 


where q is the Population Beneal mean, t, = H; — u is the additive effect 
of the jth treatment. ane ei; denote tht i 
assumed to be independently 


and variance o7. Thus the observation Xj is composed of three 


333 
NCE 
E ANALYSIS OF VARIA l > 
= nts which are added together The null hypothesis that the 
a means are equal, may now be replaced by the hypothesis 
popu 

HoT =T=.. =T= 0. 


i istic F is based 
o independent estimates of oĉ, on which the statistic F ps be 6 
ae mined by means of the least-squares Principle. That is, 
oe T, which will ininimize the sum of squares 
j 


imates of =, T; = eh 
cep vi: subject to the restriction that 


inimize T Ee 
of the errors. Thus to minimize +H ei 


) T; = 0, we put 
| i => De +1 LT- Ww ’ lti lier 
; 0 ; here À is Lagrange s mu p 
I ij [ j J 


> -p-tjJ2 +À >. Te 
= z [Xj H Tj] + F j 
Diff y w. H i zero, we get 
i E d maii r.t. Uy 17 and A, and equating to à 


(1) 
Sen 0= BEL Ay- 


ôu | 
aa . (2) 
oe 0= -2 È (Xj -H- T) +À 
Ot r, © 
(3) 
0Q =0= 2 T 
f ' X ‘Etr =0) 
ati e get i= X.. c 2 j 
From equation (1), we g 


Summing over i’s in equation (2), we get 
5 À 


m eee ete 
Xj KAY 2k 


To find A, we sum over j’s and have 


Hence ni = X.., and y mu -= X: - 
Now Treatment SS = zy = Z (X.; - 
=r 2 (X,; ki, 
Error SS = 22 hy bo 4 = 
= oe 
= x 2 (Xj -X.j) 


f 
334 - INTRODUCTION TO STATISTICAL THEORy f 
28 These sums of Squares are the same as before and are orthogona] as 
shown by Cochran’ 


Expected Mean Squares. From our model Xj = 
have 


Xj=u+ Gre; and X= y + @.., so that 
X T 


Taking expected values, we get 
ElTreatment SS] = rE [> (t+ 8,-2.2] 
Jj x 
Z 2 a 
= rEZ y] + rE% Ej 


—@..)2] 4 the cross 
Product term which vanish 


es as T; is fixed 


2 o2 
=r} t + r= (k -— 1), so that 
Jat r 


ElTreatment Mean Square] = E [Treatment S5] 


k-] 
=o924 Z 2, 
k-12 yand 
Error SS =x ~ X.,)2 
I u J? 


E [Error Mean Square] = g2, 
These results c 


Zym Mt Bye Tt e 
where X;; denotes the observation in 
His the averal] Mean for al] 


B; denotes the effect of ith Row or Block, 


THE ANALYSIS OF VARIANCE 335 
Ņ denotes the effect of jth Treatment or Galit anå 
esp the error terms which are ass 


umed to be independently and 
N(0, 62). 


The sum of squares to be minimized subject to 2 B; = Oand a T = Ois 
Q= ZX Xj- u- pi- t2 
Differentiating Q w.r.t. u and setting equal to zero, we get 
Ro -2X Xj-u-Bi -r 
Ou Lg 
As È Pi =0 and Zy =0, we iat get U=X,, 
a => ii H — Pi — T,), so that 
Again aB; 0= 2X Xj H-B;-4 
i + Bi = X;. or B; = X,.-X.. 
Similarly, we get q= xj -X.. A 
M 2 
Thus SSE = 22 %y-U-B,-%) 
= 5 Xj- X;. -Žž + X..)2, as before. 
The null hypotheses that (i) the column means (or treatment 


means) are equal, and (ii) the row means are equal, are replaced by 


r = 
Ho: Ti = T3 =... = T, = 0, 


Hga By =.= By 0. 


i i of 
These results are subject to the assumptions of normality, 
independence and of equal variance. 


Ş y y2 
Block (Row) SS or SSB = ¢ 2 (Xi -X..) , SO that 


SSB 
MSB = — : 
r-l1 


SSE 


Similarly, MSE = =D ei) 


336 


5 Pg MSC 
Hence, we reject Ho if F, = wee oF, 3 ((e- 1), (c-1) (r—1)] 


—_ s ", MSB 
Similarly, we reject Ho, if Fy = MSE Fa; [r - 1), (e~ Dia D] 


The corresponding model for the two-way analysis of variance wi 
interaction, is given by with 


Xas b+ Bit y+ Pyter 1=1,2,..,7 
jJ=1,2,..c 
k=1,2,. n 


where (Bt); denotes the interaction effect of the ith row and jth column 
To find the LS estimates, the quantity > > 2 e is to be minimizeg 
subject to the restrictions ga . 


Fhin 0 Ey =O E Boy = oand E Boy =o, 


The original null hypotheses are in thi tn 
Hetil S case replaced by the following 


Ho: Bi = Bz 


i 
1 
P 
1 
= 


am 


Ho i) = T=... = 1, = 0, and 


t 


Ho : (BI) = (Bt) jo = we (Bt), = 0 
The F-test is carried out exactly in the same way. 


EXERCISES 


(a) Discuss why using multiple two-sample t-tests ia not ‘an 
appropriate alternative to anlaysis of variance, 


20.1 


(b) What is meant by Analysis of Variance and degrees of 
freedom? What are the assumptions underlying a one-way 

analysis of variance? (P.U., B.A/B.Sc. 1996) 

20.2 


differences, (P.U., B.A/B.Sc. 1985) 


Describe what is meant by "partitioning the total sum of 


. nk 3 k r 
squares". Partition the total sum of squares > > (X,;-X..)? into 


J=li=1 


20.3 


y 


INTRODUCTION TO STATISTICAL THEORY f 


f 
| 


THE ANALYSIS OF VARIANCE w 
SS A 


20.4 


20.5 


20.8 


Er /< 
the Error sum of squares and Treatment sur of Squares. Find ` 
the number of degrees of freedom associated with each of these. 
How is the partition if the sample sizes are unequal? : 


(a) Explain in detail why it is not a good statistical procedure to 
perform several t-tests on pairs of means, when several 
means are to be compazed. Suggest the alternative and give 
its assumptions. 


h\ Derive the partitioning of the degrees of freedom for anlaysis 
of variance in one-way classification. 


(c) Describe the LSD test in anlaysis of variance. ; 
(?.U., B.A./B.Se. 1993) 
{a) What is meant by a Two-Way Analysis of Variance and an 
Interaction? l 
(b) Derive the partition of the total sum of squares for a two-way 
analysis of variance without interaction. Find the number of 
degrees of freedom associated with each cf the sources of 


variation. 
r c sas rm 
Show that (i) YD; ~ X..) (Xy — Xj. -Xj + X.) = G, 
t=]j=1 


ree ‘ = 
Gi) DL, -%..) Xj- -8 +X.) = 0, 
i=Ņ=i 
r c n es ee = 
Gi) LL E Xir- Xj. Ki- X...) = 0. 
i=Wj=lkel 
(a) Describe the method of Anlaysis of Variance for the one-way 
classification and apply it to the following data: 


Sainple Number 


(b) Use Bartlett’s test to determine if the assamption of equal: 
variances is satisfied, 

(a) Tweicy men are used in an experiment, five being assigned 
at random to each of the four ma¢hines. The observations are 
the amounts produced by the machines in one day. Test the 
hypothesis at a=0.05, that the mächines are not different y 
with respect to the number of items produced. 


| 


e . INTRODUCTION TO ST | 
. ANSTA THEORy THE ANALYSIS OF VARIANCE 
Machine Number DE a o ü OO 


| 
| 


Do an analysis-of variance on these data and test the hypothesis | 
te that the three tube types require the same average warm-up 
iscuss the assumptions involved in the above analysis eai 


(a) The followin 
& are three consecutive w 4 i 
salesmen employed by a given firm: eeks earnings of three 


20.9 20.11 Consider the following 5 random samples: 


Salesmen 


Samples 


Perform the analysis of variance and test the hypothesis at the 
0.05 level of significance, that the samples come from populations 
having the same means, . 


20.12 Describe a statistical procedure for comparing the means of k 
groups of observations, possibly of different sizes. State any 
assumptions that you make, and demonstrate the calculations 
using the following observations: 


Observations 

10, 11, 17, 

8, 10, 11, 12, 12, 15 

13, 15, 20, 

20.13 Determinations are made on the yield using three methods of 
catalyzing a chemical process. 


47.2, 49.8, 48.5, 48.7; 
50.1, 49.8, 51.5, 50.9; 
49.1, 53.2, 51.2, 52.8, 52.3. 1 


20.10 
l Do the methods differ significantly at the 5% lev. of 
significance? (P.U., B.A./B.Sc. 1922` 


340 


IN 
Bee NAAA Spee INTE 
20.14 Three sections of the same elementary mathematics 


20.15 Each of the sets of observations given 


{a) 
(b) 


20.16 


20.17 


TRODUCTION T 
ROL O STATISTICAL THEORy 


Course 


taught by 3 teachers. The final grades were recorded as follows. 


75, 91, 84, 45, 82,75, 68, 47, 95, 38. 
59, 83, 99, 77, 65, 81, 34, 81, 77, 88, 94, 51, 82 
Ç | 66, 77, 51, 90, 73, 90, 71, 68, 69. 


Test the hypothesis that the average grades given by the 
teachers are equal. Use a 0.05 level of significance, 


Teacher 


three 


below is a random. sample 
drawn from a normal population. 


Use Bartlett’s test to check the equality of variances. 


Perform the analysis of variance to test for the equality of means. 
State the hypotheses and assumptions. 


(a) Given that the sample size, the arithmetic mean and the 
variance, s2 for each of k independent samples are known, 
how do you compute the SSW and SSB to carry out an 

_ analysis of variance for these k samples? 

(b) Given the following information: 


` 


No. of values (n;) 
Sample means (X;) 


4 6 
58 57 43 42 
‘10 30.4 5.67 9 


Estimate of variance (s?) 


Construct an analysis of variance table and test the hypothesis 


ation means are equal at @=0.05. : 

(P.U., B.A./B.Sc. 1996) 
as seiected from a field 
ants of ə varicty of guayule, a plant 


that the popul 


A random sample of 54 plants w 
containing one-year old pl 
species yielding vubber. 
offtypes and 12 aberrant 


s. The percentage rubber content was 
determined from 


zach plant, and the data are given ås follows: 


(P.U., B.A/B.Se.:1975, 89 


Of these plants, 27 were normal, 15- 


Percentage Rubber Content 


Aberrant 


) Make an analysis of variance to test whether the data are 
d consistent with the hypothesis that there is no difference in 


mean rubber content for the three types of plants. Usea, = 2 
(b) Ifthe means of the distributions of each type of plant'are denote 

by Hy Ho and pa respectively, use Student’s t-statistic to test the 

following particular hypotheses: () Hy = Ma Gi) Mo = 


-o 5 diy + Ma 


i f 

20.18 Test the hypothesis that the difference between the ene e 
l students in arithmetic computation in the different yp 

schools, Grade 4 are equal, using 0.05 level of s znificance. 


aam ee e ee „3 


Frequency Da 
Residential Non-residential „„- „Mission 
; i 7 4 
25 & 
37 17 
13 29 
4 7 


2 m pe n of (0) 1 ls the followin; results 
mals, g 
0.19 In a feeding experime t some an . 


- were obtained, the numbers in the tal ier 
weight in pounds. The animals were in group 


Rations B 
c 8.5 


ons, at @= 


. ifference in rati 
Test the hypothesis of no differe (P.U. B.A 


. 342 


20.20 Construct the analysis of vari 


20.21 


(a) 


(b) 
20.22 


INTRODUCTION TO STATISTICA 
LTI j 

ance table for the followin d 

_ Factor B . i. 


1 
2 
Factor A 3 
4 
5 37 
Test the hypotheses that 


i 
fe; Factor A has no influence on yields, and 
i) Factor B has no influence on Hates. 


S 1 
cat B Were fed on thre 
Four breed of tle B ) Bo, B3, 4 e different 


rations R., R 

l R}. Gains i eg 

were rec A 2 ita. Gains in weight in pound : 
orded as given below: nds over a given period 


different breeds of cattle, 


Ive plots of d 
, an A i groun 

d each variety is treated with five 
as follows: 


lysis of yay A 
_vetlance is applied to find out whether 


of varieties ind ifference exists h os ge 
ependently of the fertilizer and le i 
: ifferentia 


343 


THE ANALYSIS CF VARIANCE 
effect is exerted by the fertilizer independently of the variety. 
(P.U., B.A/B.Sc. 1963; D.St., 1960; C.S.S. 1961) 


20.23 An experiment is conducted in which 4 treatments are to be 
compared using five subjects. The following data are generated: 


Subject i 


Treatment 


9.6 17.1 


Perform the analysis of variance, separating out the treatment, 
subject and error sums of squares, Use a 0.05 level of significance 


to test the hypothesis that there is no difference between the 
treatment means. (P.U., B.A/B.Sc. 1984) 


A certain company had four salesmen A, B, C and D, each of 


20.24 
whom was sent for a week into three types of area, country area 
K, outskirts of a city O and shopping centre of a city S. The sales 
im pounds per week are shown below: 
Salesman 


K 
District O 


Carry out analysis of variance and interpret the results stating 
the assumptions under which your results are valid. 
20.25 The following data represent the marks obtained by five students 


in three subjects: 


Subjects 
Statistics Economics 


Use a 0.05 level of significance to test the hypotheses that 


(a) The couzses are of equal difficulty. 
(b) he students have equal ability. (P.U., B.A./B:3c. 1987) . 


4 
344 INTRODUCTION To STATISTICAL ay | 
; ss a 4 ; Hz 
20.26 a out an appropriate analysis of variance on the fi l xi THE ANALYSIS OF VARIANCE 345 
ata: ow; z 
% | 20.29 Use Duncan’s multiple range test, with a 0.05 level of 

| significance, to test all pairs of sample means in Exercise 20.11. 

l 20.30 Each of five varieties of corn is planted in three plots in a large 
field. The respective yields, in bushels per acre, are indicated 
below: 

i Var 1 Var 2 Var 3 Var 4 Var 5 
7 46.2 
i 5 51.9 
X 48.7 
8 5 (a) Test whether the differences among the average yields are 
Use a 0.05 level of significance to test the hypotheses that stalistically signiticants Liat ct = 0.08, 
a f 
(a) the column-means are equal; (b) Use Duncan’s multiple range test to make comparisons 
(b) the roW-ineare ave ant between pairs of means. 


20.31 (a) What is the least significant difference and how is it used in 


20.27 interpreting the result of an experiment? 


(b) The effects of four types of graphite coaters on light box 
readings are to be studied. As these readings might differ 
from day to day, observations are to be taken on each of the 
four types every day for three days. The results are: 


Graphite Coater Type 


5.0 
5.2 
5.6 


Construct an analysis of variance table. Determine at: 0.05 ievel 
whether there is significant difference between Coater Types. 
Apply the LSD test to locate significant difference between the 


eye | various pairs of Coater types. | 

20.28 Discuss each of the following: Se Maes TA) 20.32 Discuss the difference between fixed effects and random effects 
(a) © The Fisher’s Least Signifi , models in the analysis of variance. Write down a raone, anaiysis | 
(b) The Bianco § — Different Test, of variance table and kell a square for a one-way | 
OR. Daia man-Keul’s Multiple Range Test. classification, assuming fixed effects. E | 
“nean s Mu tiple Range Test. 20.33 (a) Explain what is meant by the terms /ixed-effects model, | 

(d) The Scheffe'g Method, . random effects model and mixed model in tke analysis of 

variance. 


(b) Minimize the following funstion: 


346 


20.35 


with the help of partial derivatives. 


Evaluate the expression for u, Q; and B; with the condition 


that 
r c 
Za = YB =0 (P.U., B.A/B.Se. 1963) 
t=] j=1 


(a) that the T treatments 


constitute the only treatments of 
interest; i 


(b) that the T treatments are a rar, 


dom sample from a large 
population of treatments. 


(P.U., B.A./B.Sc. Hons, 1972) 
A highway research engineer wishes 
four types of subgrade soil on the moistur 
soil. He takes five samples of each ty 
total sum of Squares is computed a 
Squares among the four types of suber, 


(i) Set up an analysis of variance table for these data. 


(iv) Set up a set of orthogonal co 
(v) Explain briefly how to set u 


Po oe o% 0% 0 Mo o% aM 0% o 
t eco 
10? %P Oe oso %° ee tot %e ee -$ 


Statistical Inference in 
Regression and Correlation 


21.1 INTRODUCTION 


` In Chapters 10 and 11, we have introduced aire of A 
i r li ' regression models and correlation. 
ts and techniques for linear regres 
iiis Chopin we consider the inferential procedures associated with 
yi . . 
linear regressions and correlation coefficients. 


A simple linear regression model that describes the relationship 
between X and Y takes the form 


Y; =Q + BX; +6; or Yi=Myy t & 
w m I m errors £;’s ar med to be 
here €,’s are rando errors. he rando errors Ei S are assu 
i 


i 5 _ = 
independent of X; and normally distributed with E(€;) = 0 and Var(e;) 


Yx t P t 
G. , a constant for all x These assumptions im ly that Y also have 


common variance © as the only ranaom del 1S E;. 
yy d element in the mo i 


= L = X: i i fr 

x y i line E(Y;) ly, x a B i 

The regression i = + xs est mated om the 
j s ° a aw ý nx? - (ÈX)? 


Y is the sample estimate of the population mean Ļly y. 


i iti i Y and 4 
It is evident that in a linear regression, the tate a ° dite 
will vary from one sample of data to another. They 


hiti ` rposes of : 
variables and hence have sampling distributions. For the purp 
~~ 


i s and the shapes 
statistical inference, we must know the means, variance 
¿f these sampling distributions. 


347 


348 INTRODUCTION T 


2 O STATISTICAL 7 

? We derive the means and variances as below: B 

(i) Mean and Variance of the Sampling Distributio 
mean of the sampling distribution of b is He = Bandt 
ME. 2 5 
is O, = Oy y /X (X; - %2. 


he Variang 


In Example 15.6 on page 75, we found that 


Ly = Bb) =B, 
X,- XH (Y.-F 
Now o; = Var(b) = Var Each Nak 
È (X; — X)2 
2 (X; -X) Y, 
" V. 3 L U oe Er =' 
j Sa ee Pe LaDy | 
= LADY} 
Since terms involving X; are constant in a regression model 
therefore 
2 1 


Oo, = ——_ a 2 si 
è? E (X; - X)2]2 2 (CX; — X)? Var(Y,)] 


p 2 

fa = (07 
-ey Laan A; -o XU 
Dapa Ova = Foe a 


Since the random errors £; 
be normaìly distributed, therefo 
mean of py = 


in the regression model are assumed to 
re the distribution of b is normal with a 
B and a standard deviation (or standara error of b) of 
we O = Oyx/ NF: X- 2, 
. 2 F 
Generally, Oy x Vill be unknown, we therefore require an estimate 
P . 
of Oy, y from the sample data. The ur biased estimator is given by 


2 . Z(Y,- y 
rx ea 


, : 2 
Thus the estimate of Gy, denoted by si, may be taken as 


2 
2 YX 
Sy 


“Eh Be 


HEORy | 


n OF b, The | 


STATISTICAL INFERENCE IN REGRESSION & CORRELATION 34y 
— a 


(ii) Mean and Variance of the Sampling Distribution of a. Th2 


mean of the sampling distribution of a, is h, = Q and the 


iance is o% = oF [> + a ] 
vari = Oyy L- + aal. 
a n D (X; = x)? P 
ın the estimated regression line Y=a+bX, a is given as a=Y-bX. 


Therefore 
Ha = E(a) = E - bX) 


=E [22-2] =" EY) -XEO) 


But E{Y;) = a+ BX; and E(b) = B. 
Ha = OHP) _ 5p <a + BE- P=. 


i ae 
And o? = Var(a) = Var [ an -o3 ] 


Var [24] + Var [ ox | 
C. band ¥ are independent) 


= 2 È Var(¥;) + X? Var(b) 
n 


z eye 
oO Eee 
= L + x? = j 


i 2. me 
When teas is ur.known, we use Sy y in place of Oy y - 


The assumption that the randoni errors é; are normally distribated, 
leads to the fact that the distribution of a is normal with mean }l, = Œ 
and standard deviation 
1, #_ 

n EX-B : 
Gii) Mean and Variance of the Sampling Distributior! of Y. We 
can also find the mean and variance of Y when it is used as an 


x: Let Y represent the estimsted value of 
r a given value Xg. Then we have 


a Oa = Oy.x 


estimate of the mean }ly, 
Y; in a linear regression, fo 


A 


Y =a + bXo 


350 INTRODUCTION TO ST. 


ATISTICAL THEoR, f 
where Xg is the value of X, on which the estimate is based, 


Now py = E(Y) = E(a +'bXp) 
= 0. + BXo and ('.' Xp is constant w.r.t. expectation) 


Var(¥) = Var(a + bX) 


oF 
á = -= 
= VarlY + bX% - 5) (a= Y-oX 


= Var(¥) + (Xo-X)? Var(b) ( F&b are independent) 


[1, Z] 
n Yax-X2"" 


2 
= Oyy 


2 2 D a 
We use Sy y in place of Oy y when Oy y is not known. 


The assumption that the €;’s are n 
distributional fact that the distributi 


on of Y is normal with mean 
lly = a + BX, and standard deviation 
` -52 
OY = oyy [24 az 
n EX- 
the means and standard deviations of these 


confidence intervals of, or test hypotheses 
meters in a regression model. 


With the information on 
“tatistics, we may construct 
about, various unknown para 


21.2 INTERVAL ESTIMATION IN THE SIMPLE 
LINEAR REGRESSION 


_ 21.2.1. Confidence Interval Estimate 
Regression Co-efficient B. To construct a confi 
the population regression co-efficient 
B. The Sampling distri 


of Population 


dence interval for B, 
» We use b, the sample estimate of 
bution of b is no~mally distributed with a mear. B 


on Oyy / VE (xX X)?. That is, the variable 


= = bB 
Syx /VX(X- XH? 


is a standard normal variable, 


and a standard deviati 


Z 


ormally distributed, leads to the ' 


STATISTICAL INFERENCE IN REGRESSION & CORRELATION l 351 
STATISTO -Me T MEORESSION & CORRELATION 351 


But Oy x is generally not known, we therefore estimate it from the . 


2Y- 


sample data by sy x = . We shall then use the Student’s t 
a é 


distribution rather than the normal distribution. In other words, the 


statistic . 
f p — b-p = b-p pe Sp a 
vx /VEK— 2? Sp VE x - 


“which follows a Student’s t-distribution with v=n-2 degrees of freedom, 


is used. l : 
Consequently, we make the following probability statement 


b- sie 
P [-ta/2,%0) s ra < taz] aii 


where t /2,v) 18 the value of the t-distribution with v degrees of freedom, 
 WRELE 1a /2,(v 


leaving an area equal to @/2 to the right. The probability statement after 
simplification becomes 
l P [b = bau /2,(v) Sp < B <b+ ta/2,w) Sa] =1-a. 
Hence a 100(1 — &) percent confidence interval for B, the oes 
nein co-efficient, for a particular sample of size n(n <30) is g 
b E ty jo (n-2) Sb» 
2 syx -— Y- 

* E-D (n- 2) LK -H)? | 

| Example 21.1. In a linear regression problem, the following 
results are obtained: a ei 
F = 12.13 — 0.1608 X, Z(Y; - Ð)? = 1160, D(X;-X)?=1300, and 
n, the number of pairs of the values of X and Y = 24. 


where 


um. W i rva for B, the 
ing normality compute a 95 O confidence nte l 

Ass , tl 

population regression co-efficient. 


The 95% confidence interval for B is given by 


b E tyjo(n-2) Sb 
= 2.074, and (from t-tables) 
Here b = —0.1608, to,095,c24-2) = 2-014, 


2 srs _ S- 
s 5 K-H? (- 2) LK-X? 


-- 1160 __ 9.9406, so that 
= (22) (1800) 


352 l i ! INTRODUCTION TO STATISTICAL THEORy f | STATISTICAL INFERENCE IN REGRESSION & COREENE i 353 | 
' s, = V0.0406 = 0.20. = ¥- uy 
Substituting these values, we get sY : 
—C.1608 + (2.074) (0.20) | has a Student’s t-distribution with u=n-2 degrees of freedom. Using t- 
0 ONE E E N | statistic, we may construct a confidence interval. for uyy for a given 


velue Xo, in the usual manner. 
Hence the desired 95% confidence interval for population regression, co- 


` Hence a (1 — &)100 per cent confidence interval for the mean value 
efficient B is (—0.586, 0.254). py x» When X=X is given by 
21.2.2. Confidence Interval Estimate of a, the Intercept of 7 $ 
Regression Line. To construct a confidence interval for & we use a, 


the sample estimate of &. We have already observed that a is distribuied | where 


Yo E tayain-2) SY » 
Yo = a + bX. i od 


Example 21.2. Given the data 


normally with mean Ha=Q and standard deviation Oy= Oyy EGE 

n DOH? 
Since Oy y is usually unknown, we use its unbiased sample estimate 
Sy x. 


Fertilizer (X) 03 06 09 12 15 418 21 24 


Corn Yield (Y) | 10 15 30 35 25 30 £50 = 45. 


| Assuming normality, calculate the 95% confidence intervals for (i) the 
| value of æ, (ii) the value of B, (iii) the true value of Y when X=3.0. 


It can be shown, for the normal regression model, that the statistic 


. | I 5 . 
ja te | To find sample estimates of various parameters in the regression 


$ 


Sa | model, the necessary computations are given below: 


fi xe ; 
where s, = Syy "hs S-Dp?’ has a Student’s t-distributiow with 


v=n—2 degrees of freedom. Using this statistic, 
a, can be constructed in the same 
-with a replacing b and Sa replacing 


a confidence interval for 
way as the confidence interval for B, 
Sp. 
Hence a (1 — «)100 per cent confidence interval for a is given by 
aE ty 79 (n-2) Sa 

21.2.3. Confidence Interval Estimz ce of Mean Value hy y for 
a Given Value Xo. Perhaps the majox goal of a regression study is to 
use the estimated regression equation Y = a + bX to estimate the mean 
lly y of the Y values, when X = Xo. We have observed that the sampling 


distribution of Y is normal with uy = yy (=a+BX) and standard t - D HS n©XY - DXEY _ (9 (0855)— 108 GA) 
semen 9p = oyy aja i ae Since Oy » is usually unknown, nDX? - (ZX)? i (18.36) = 
We therefore find the sample estimate of oy, denoted by Sy, as 5 m - a = 30.24 ~ iii 

Sy = Syy [b+ oe, ; a = ¥—bX = 30 — (16.27) (1.35) = 8.04. 


The estimated regression line is 
It can be shown that the statistic ; 


352 INTRODUCTION TO STATISTICAL THEORY 


——$—<$—$<—$—$—<— 


i są = V0.0406 = 0.20. 


Substituting these values, we get 
—0.1608 + (2.074) (0.20) 
or —0.16 08 + 0.4148 or —0.5856 to 0.2540 


Hence the desired 95% confidence interval for population regression. oo. 
efficient B is (—0.586, 0.254). 


21.2.2. Confidence Interval Estimate of a, the Intercept of 
Regression Line. To construct a confidence interval for & we use a, 
the sample estimate of &. We have already observed that a is distributed 


ene 1 X 
normally with mean |1,= and standard deviation Oy Oy y ae 


Since Gy y is usually unknown, we use its unbiased sample est 


imate 
Sy.x- 


It can be shown, for the normal regression model, that the statistic 


a-a 
t= ; 
Sa 
1 x? Bev E <chitaan acl 
where s, = syy "hs Sg- pr’ has a Student’s t-distributiow with 
v=n-2 


degrees of freedom. Using this statistic, a confidence interval for 


Q, can be constructed in the same way as the confidence interval for b, 
. with a replacing b and Sa replacing s}. 


Hence a (1 — «)100 per cent confidence interval for Q is given by 


@ ta 72,(n-2) Sa 


21.2.3. Confidence Interval Estime ce of Mean Value Hy y for 
_ a Given Value Xo. Perhaps the majoy goal öf a regression study is to 
use the estimated regression equation Y = a + bX to estimate the mean 
Hy.x of the Y values, when X = Xc. We have observed that the sampling 
distribution ‘of Y is normal y ith Woe i = r 
ies yy (=@+BX) and standard 

deviation Gp = g 1, -D a f 
ye YX V hee S- He Since Oy x is usually unknown, 


we therefore find the sample estimate of Oy, denoted by sy , as 


e 
d ~¥%2 
Sie syy [hy Sa 
Ve Ex- 
It can be shown that the statistic 


has 8 Sede. ` 
geatistic: WS 

“y ET atk 
yelueLo ‘™ 


Hence 


Assuming nOr 
value of n, Hither wate 
To find sample estim 


model, the necessum 


a f . d 

so a 

lict an 
ling ta a 
meagand 


y, we have 


N ow 


356 INTRODUCTION TO STATISTICAL THES f 
; -HEOR 


~ It has been shown that Y} follows the normal distribu 


Si tion With | 
F 2 1 &-x i | 
mean & + BX, and variance Oy y [ l+ ne San pe 


tg ‘ 2 : 
Since oy x is usually unknown, it is estimated. by Sy y from Sata 
data. It has been further shown that the statistic 


Yo- A 
TEA 


follows a Student’s t-distribution with v=n—2 d.f. For t 


: he purposes of 
inference about Yo, we therefore use the t-distribution. 


To consider the construction of a confi 
see that the quantity to be predicted is a value taken by a ra 
variable, not a parameter of a distribution. The 
thus zòt strictly a confidence interval, it is usu 
prediction interval. A pr 
desired probability. Hen 
individual value Y, 


dence interval about Yo, we 


ndom 

appropriate. interval is 
ally referred to as the 

ediction interval is one that, contains Y with a 

ce a (1-)100 per cent prediction interval for an 

that corresponds to a given Xp, is given by 


F . ~  (X)-X)?2 
Eto ‘ = 4 So 
Yo E tajain-2) Sy.x\ |1 + a Sali” 


A prediction interval fo 


ran individual value of Y is always wider 
than the confidence interval 


fora mean value of Y, 


Example 21.3. Using the data of Example 21.2, construct a ¥5% 
prediction interval for Y when X = 3.0, 


“In Example 21.2, we found the estimated regression line as 


` 


Y.= 8.04 + 16.27 X. 


8.04 + 16.27 (3.0) = 56.85. 
Further, n = 8, Ë = 1.35, Xp = 


Now, we find the 95% predi 


When X = 3.0, Yy = 


3.0, sy x = 7.05 and tp 995, q)= 2441. 


ction interval for a Y-value wher 


56.85 + (2.447) (7.05) 144, @.0= 1.35)? 
l 8 3.78 


56.85 + (2.447) (7.05) V1 + 0.125 + 0.7202 


or 


TISTICAL INFERENCE IN REGRESSION & CORRELATION f 357 
TA O a A E 
s 


page 56.85 + (2.447) (7.05) (1.358) 
si 56.85 + 23.43 or 33.42 to 80.28. 


Hence the desired 95% prediction interval for an individ 


ual Y value 
at X=3.0 is (33.42, 80.28). 


` 91.3 HYPOTHESIS TESTING IN THE REGRESSION MODEL 


In this section, we present several procedures for testing hypotheses 
bout unknown parameters in the linear regression and also about the 
B . 
linearity of regression. 


21.3.1. Testing’ Hypotheses about B, the. Population 
Regression Co-efficient. Suppose that we wish to test the hypothesis 
that the population regression co-efficient, B has some specified value Bo, 
We draw a random sample (X,, Y,), (Xp, Y2), wy Xnr Yn) of ar 
observations from a bivariate normal population and find the estimate 


regression equation Y = a + bX, where b, the sample estimate of B is 


z a 
normally distributed with mean B and’standard deviation oy. x/NEK X) 


It has been shown earlier that, when Oy y is estimated from sample data, . 


the statistic 


-22B 


Sp ` 


Sy.x 


where s, = SSS 
, in 1 the hypothesis 

degrees of freedom. This statistic is therefore used to test the yP 

Hy: B=B, and the procedure is given below: 


has a Student’s t-distribution with v=n—-2 


(i) Formulate the null and alternative hypotheses about B. Three 
possible forms are 


Ho: B = By and H: B + Bo 
Ho: B < Bo and H,:B > Bo. 
Ho: B 2 Bo and H,:B < Bo. 
(ii) Decide on the significance level ot. 
(iii) The test-statistic to use is 
b - Bo 


he 


Sp 


, 


? aA 
ee 


a 


— 


Sets EAT OT es 


358 ; ‘INTRODUCTION To st 


freedom. l 
(iv): The critical region is 
| £ | 2 tayo,¢-2» When Hy is B # By 
t 2 ty (nay When H, is B > By 
‘ts oo when H, is B < Bo. 
(v) ‘Compute the — equation Yy =a + bX, Syy, 


Sy and 
b- 


: Bo from the sample data. 
b ma k 


t= 
(vi) Decide as below: 


: Reject Ho if t falls in the critical region, accept Ho otherwise, 


` Most frequently, we are interested in testing the hypothesis that 
-Ho : B ='0 against H, : B #0. It is important to note that testing the 
hypothesis that B=0 is equivalent to testing the hypothesis that the 
variable.Y is independent of the variable X (in a lin 
Statistic then becomes t=b/sy. If we reject Hy: B = 


the two variables are. linearly related. If we accept Ay: B 
conclude that the two variables are not linearly related. 


Example. 21.4. Ina linear regression 


problem, the following sums 
were computed fr : i 


om a random sample of size 10. 


ÈX = 320, EY = 250, ZX? = 12400, XY = 9415 and DY? = 7230. 
Using 5 per cent Significance level, 


c test the hypothesis that the 
population regression co-efficient, B 


is greater than 0.5. 
G) _ We stato our null and alternative hypotheses as 
Ho: B <05, and Hy: f > 0.5, 

(ii) | The significance level is setat q = 0.05. 


(iii) The test-statistic, under Ho, is 


svat bts 


So Sy 


which has a Student’ 
of freedom, 


s (-distribution with v=10-2, i.e. 8 degrees ,, | 


ATISTICAL THEO, N 
which, if Ho is true, has a t-distribution with v= 


n-2 degrees of | 


f 


car sense). The teste | 
0, we conclude that | 


= 0, we 


wait REGR j Ti i 359 
F \ON 
NCE IN REGRESSION & CORRELA 
TICAL IN ERE 
STAT Is 


nEXY - ENT 
nUX? — (ZX)? 
_ (20) (9415) — (820) (250) _ 14150 _ 0.655; 
~~ (10).(42400) = (320)? 21600 ` 


s-h & fig 320) L 4.04; 


Ww) - Computations. Now b = 
i i 


2 _EY-P_ IY -aZY-bEXY 


Sy.x n-2 n-2 
7230 — (4.04) (250) — (0.655) (9415) ae sans 
on £290 = (4:04) (Zou) = VNo TTL 
{ 10-2 
= 6.647 sothat syy = (6.647 = 2.578, 
>g _ (x? 
EK? = £ 
(320)? = 2160 
= 12400 — 10 , 
Sy-x _ 2.518 _ 2.578 _ 0.055 


and ‘Sp = Fx = Fa a [2160 46.476 


bs = 1.86. 
(v). The critical region is £ 2 £o,05,(8) 


| =2.82 falls in the 
Si lated value of t=2. ie 
i ion. Since the calcu ae 
ii eens so we reject Hy. We may ae . ae 
me Ear to indicate that the popula i 
‘sufficient evie i 
efficient is greater than 0.5. l 


21.5. Estimate a regression line fro 
pipe of 12 peisons: 


m the following data’ 
Examp 
of height (X) and weight (Y) 


110, 135, 120 
120, 140, 130, 135 
150, 145 

170, 185, 160 


Sn on 
ression coefficient Pot 


t the population 3.05 level of significance. 


Test the hypothesis e iai Use a 0.0 


ie., height and weight arein 


360 INTRODUCTION To STATISTICAL Thy 


' 


à) We set'up-our hypotheses as 


H-: :B= = 0, i.e., the ree variables X and Y are not related l 
H. : B =+ 0, ic., the two variables are related. 


(ii) The significance level is set at a = 0.05. 


(iii) 


(iv) 


The test-statistic, if Hy is true, is 


på, 
Sp 
2 r 
where pa srx A 2r- 
è? EX- 2 EK? 


Assuming that the distribution of Y; for each X, is normal with 
the’same mean. and the same standard deviation, the Statistic; 


confcrms to the Student's t- distribution with n—2= 


freedom. 10 degrees of 


Computations, The necessary com) 


putations are given below: 


o = AEA Expy 
nZX2 — (£X)2 

„ 10360 
2060 7 5-03, and 


j= (2) (109380) ~ (766) (1700) 
(12) (49068) — (766)2 


af STATISTICAL INFERENCE IN REGRESSION & CORRELATION 


. Where s a™ Sy., 


361 
a= ¥-bX¥= He, 03) c 


= 141.6667 — 321.0816 = —179.41. 
¿o Ë= -179.41 + 5.03X 
Now E-P? = EY? -aY -bExY 
l = 246100 + (179.41) (1700) — (5.03) (109380) 
= 246100 + 304997 — 550181.40 = 915.60, 


Ex- = ae = 49068 — ee eo 974, 67, 


ZY- fI 
Eo 215.60 = 9.5687, and 
n- “io 


’ _ 9.5687 9.57 sa 
4 EE D2 Viner 13.1079 
b _ 5.03 
fn ay org” 888 


(v) The critical region is | t | -2 t9.995 (10) = 2.23. 


(vi) Conclusion. Since the computed value of t=6.89 falls in the 


critical region, so we reject the null hypothesis and may conclude . 
that there is sufficient reason to say at the 5% level of 


significance that heights and weights are related. 


21.3.2. Testing Hypotheses about a, the Intercept of 
Population Regression. Suppose that we wish to test the hypothesis 
that.a, the population intercept, has some specified value Œg. The 
sample estimate of a, is a which has been shown to be normally 
distributed with a mean of a and a standard deviation of 


Oyx 1 n'y p It has already been observed that, when Oyy 


is ae ae from sample data, the statistic 


4 i t= 


ipa E 
*tZa-D' 


‘ven-2 Seat of freedom. Hence we use t- distribution to test the nul; 


has a Student’s t-distribution with 


ees een 


f 
i 
l 
è 
i 


362 ` INTRODUCTION TO STATISTICAL 3 n 
Eory 


hypothesis Hy : & = Qg against an appropriate alternativo 


- hypoth x | 
` The rest of the procedure is the same. ei | 


f 


21.3.3. Testing Hypothesis about Mean Value Hy y for | 
Given Value Xp. Quite frequently, we are interested in testing a 
hypothesis that the mean of a population of Y;’s when X=X,, has fas | 
specified value pọ. That is, we wish to test Hy : uy X = Ho. against, 
suitable alternative hypothesis. We obtain an unbiased esti 
for X=Xo from its sample counterpart Y=a+bXp. 


Mate of Uy 
It can be shown that 
Hp is true, the Statistic i 


_ (a + bXo) = Ly 
(Ss to ge 
2 2 [1 (X)-X)?2 a re ee 

+ a= =+ z~ |. h Student’s t-d tr ith | 
where s5 Sy x [- + EK p? ], asa ent s t-distribution with 


v=n—2 degrees of freedom. Consequently, t-distribution is used to test, 


hypothesis about Hy y for a given value of xX. The test Procedure js | 


carried out in the usual manner. 


` 21.3.4- Testing Hypothesis about Population Variance Ory | 


F , 2 2 i 
If we wish to test the null hypothesis Ho: Sy x = Og where Gy is some 


specified value of Toran against a suitable alternative hypothesis, then | 
the statistic - p 5 
ni (m= 2s 
y2 = EX 

has, if. Hy is true, a chi-square distribution - with (n-2) degrees if 

freedom. We reject Ho if the calcu 
region, otherwise we accept it. | 

21.3.5, Testing 
efficient of Two Re 
regression lines 


My x = Q; + BX, and Hy, x = Q2 + BoXy 
and we wish to test the hypot 


equal regression co-efficients, t 

` Let b, and by, the least-squares estimates of B, and By respectively, 
be obtained from ‘tw. 
sample is taken fro 


hat is, we wish to test Hy : Bı = Bo 


m one population and the second sample is drawn 


from another. population. Then ‘the statistic by 


lated value of x? falls in the critical | 


Hypothesis about Equality of Regression Co 
gression Lines. Suppose that there are two linear | 


hesis that the two regression lines have | 


© random samples of sizes, n, and. ng, the first | 


` STATISTICAL INFERENCE IN REGRESSION & CORRELATION . 363 


— be is normally 
2 
d with mean = B-B, and variance = fy 8 Oy..x 
istri w = Pi-be oS 
distribute FOr, Sra A 
s the populations are assumed to be- normal for each X. If the two 
a 


‘populations have equal variances, then the variance becomes 
p : s ; 


ETENE 
am UX); = xp? : U(X); -X,)* 


As Gry is generally not known, its pooled estimate, denoted by 


Ph ge x is obtained by the relation 
Y.X.p Sog AE T 
2 L(Y- Y)? + EY, - Y)? 
SY a eea 


ny +Ny-4 


It can be shown that the statistic 3 
a b1- by) = (B, — By) 
t= x 


, l [ lo; 1 }2. 
: ie D(X ;-X,)? ZX- Xp)? 


~ conforms to a t-distribution with v=n,+no-4 degrees of freedom. Hence, 


for testing the hypothesis that B,=B, given equal. variances, we use the 
statistic 
, ` b, = by 

1 1 1/2 
wanes E + ER — 
EXX? EX — Xp)? 


t= 
SY.X.p. [ 


which, if Ho is true, has ‘a t-distribution with n,+Nn 4 degrees of 


freedom. The rest of the procedure for testing the hypothesis is the 
same. : © % 


A (1-c1)100 per cent confidence interval for 8 ,—B. is given by 


ae ee, 


= + —s 
EXX)? LK yg; - Xp)? 


Example 21.6. Two random samples as detailed below have been 
drawn from two populations with equal variances: 


| uA l 2 12 10 13 
- ra 1 
A Ny 1 2 l 
x ii 6 5 : 1 


(b 1-b) + la/2;in tng) SY.X p 


SS oe 


364 INTRODUCTION TO STATISTICAL +i is 

(a) Find the estimates of B} and Ba, the regression coefficients a, 
two regression lines. he | 

(b) Test the hypothesis Ho:B,~ Bz against the alternative H; if | 
Use a = 0.05. (M.Sc., P.U. 1988, 90; LU., 1994 | 

(a) To estimate B, and Bz, we carry out the necessary calculations, 
below: — 

Sample - I 


S 


: 2 
ZX = 15, LY; = 47, EX}; = 59, EX}; Yy; = 179, and ZY? = 557 


by 


o C ÈX Ya) ~ EX EY _ (4) 179) ~ 5) an 
(ny) (EX?) (EX)? (4) (59) — (15)2 


Sample - II 


bo 


Thus the estimates of 


2Xy; = 7, LY); = 26, = = 15, DXo; Yo; = 47, and oy = 174, i 


= CD (2Xj Yai) — (EX) (LY ai) (A (47) - (7) (26) 


— = 0.545 
(na) (EX3) - (EX2)? (4) (15) — (7)2 


Bı and By, the two regression coefficients of 


the two regression lines are b 1 = land bo = 0.545, . 


` (b) 


(iv) 


(i) We stats our null and alternative hypotheses as 
Hy: B; = Ba, and H, : Bı =+ By 
(ii) The significance level is setat & = 0.05 
Gii) The test-statistic under gis 
t= a ee ort 
1 a 1 a 
Sy x, [ = = e 
P V Ley HX)? XX; - ray 
which has a t-distribution 


with n, + ny- 4 (=4) degrees of 
freedom. : : i | 


Computations. The necessary computations are given below: 
Sample I: a; = ¥, -b,ž, = 11.75 — (1) (3.75) = 8, 


2K i“ Xy)*= EXT (EK )2/n | = 59-(15)2/4 = 2.755 
ay 2 l 
i Xp? = XY); - aLYy-bEX iYi; 


= 557 — (8) (47) — (1) (179) = 2.00 


f STATISTICAL INFERENCE IN REGRESSION & CORRELATION 


365 
ample II; ay = Y,- bož; = 6.50 — (0.545) (1.75) = 5.546; 


2(Kyj-X,)? = EX (Exa) ?/ny=15-(7)2/4=2.75; 


2 2 
L(Y; — Yx)? = Yz ~ 2LYo; - bad Xo; Yo; 


= 174-(5.546)(26)—(0.545)(47) = 4,189 


EYun- V? + EY- Yp)? 
i 


, ge 2.00 + 4.189 
Now Syx p= eo 


ny +ng-4 4+4-4 
= 1.5473 so that sy y „ = (1.5473 = 1.2439, and 
i 1 ~ 0.545 045 


1 > 1 (1.2439) (0.8528) 
1.2439 2.75 * 275 
0.455 


“T0608 7 0429 


(v) The critical region is |t| 2 to.025,(4) = 2-776 


(vi) Conclusion, Since the calculated value of £ = 0.429 does not fall 
. in the critical region, we therefore cannot reject Hy and conclude 
that the two regression coefficients are equal. The acceptance of 


Ho: B, = Ba means that the two regression lines are parallel. 


21.3.6. Testing Hypothesis about the Linearity of. 


Regression.. Frequently, we are. interested in testing the null 
hypothesis that the regression model is linear, i.e. we wish to test 
Hy: Hyx = & + BX. For this purpose, we select a random sample of n 
observations with & distinct values of X and for each (or at least one) 
distinct values of X, the Y observations are repeated nj, Ng -» Np 
(Ln; =n) times respectively, as indicated in the following table; 


Yup Yiv 


Yay Jav 


Xk Yro Yeo» vy kn 


Where Yi. represents the sum of y-values corresponding to x;. It has been 
shown that, when Hy is true, the statistic 


aes ER EE AO oe 


aS INTRODUCTION TO STATISTICAL THe 
2 sMO ON 


le 2) p 


En = by 
a n (Zy)? _ 
where xi = z. s b?È (x — ¥)2, and 
erm i l 
y 
Die 2 i: 
; Xa = Ly; a Ea 


has an F-distribution with v; =k—2 and vy=n—k degrees of freedom, T | 
determine a, b and D(x — X)*, we fit the assumed regression line, taking | 


the n data points. 


We reject the null hypothesis of linearity of regression when the 


computed value of F falls in the critical region located in the right tail g | 
the F-distribution with v}=k-2 and vọ=n—k degrees of freedom at the | 


significance level of &. The rest of the procedure is the same. 


Example 21.7. The following data show the heights (X) ani 


weights (Y). of twelve men. We selected the heights in advance and then _ 
observed the weights of ‘a random group of men having the selected | 


heights. 
X: 60, G60, GO, 62, 62, 62, 62 G4, G4, 70, 70, %. 
n Yi MO, 135, 120, 120, 140, 130, 136 150, 145, 170,185, 100 


Test the hypothesis at the 0.05 level or significance, that th: regressionis | 


linear. 
Using all the twelve values, we find that the estimated regression line is 
| Y = -179.42 + 5.08X, and 
ie 2 2 
E-z) = x2 _ Oxy? =.49068 — (766)"" = 171.67. 


2 12 
To test the hypothesis, we proceed as below: 


(i) We state our null and alternative hypotheses as 


Hg : the regression is linear, ie. lly = a + BX, and 
Hy: the regression is non-linear, 


(ii) The significance level is set at & = 0.05. 
(ii}* The test-statistic to use is . 


Kile = 2) 


F = ——____ 


Xa/0 =k) ’ 


STATISTICAL INFERENCE IN REGRESSION & CORRELATION 
ee E aaa ae 


2 
Yi (Zy;.)2 " 

2 

where X, =È =a =e -b22 —3)2, and 
t . 
| p 
2 5.2 i 
ig? Eyji 
o l 


which, if Hp is true, has an F-distribution with v,;=k-2 and 


ug=n—k degrees of freedom. 


(iv) Computations. To find the necessary calculations 


, We re-arrange 
the data as follows: D ; 


2 
2 ee 
Now x = sr E Ppa 
: à 
'. _ (365)2 , (625) | (295)? , (515)2 _ (1700)? 
ye a” a Pra 12 
- (5.03)? (171.67) 
= 245,235.41 — 240,833.33 — 4, 343.41 = 58.67, and 
ye 
2 2 in 
Xe = Ly; = Eg’ l 


2 
(525) 3 
4 


295)2 (515)?) 


ll 


» [(365) 
(110)2 + (135)? + ... + (160)? — + 


eer y 


2 3 


246,100 — 245,235.41 = 864.59 


x /(k-2) 58.67 / (4-2) _ 29.335 


F- _90.01 / 4-4) 


: 2 ai 7 864.59 / (12-4). 108.074 


and vy = 2/and v= 8. 


367 


[pre me 


SIS 


J 
Y 
q 
í 
4 


If the sample is large enough 


368 INTRODUCTION TO STATISTICAL THe f 
OR 


(v) The critical region is F 2 Fo.o5(2, 8) ='4.46 


(vi) Conclusion. Since the computed value of F=0.27 does not f alh 
the critical region, we therefore, cannot reject Hy and 
conclude that there is sufficient evidence to indicate that th 
regression is a straight line. i 


21.4 CONFIDENCE INTERVAL ESTIMATE FOR POPULAT 


ION 
CORRELATION CO-EFFICIENT. 
` Let r be the sample correlation co-efficient obtained from a randon 
sample of n pairs of values from a bivariate normal population having, 


linear correlation p. As a sample mean ¥ is used. to estimate th, 
population mean HJ, in the same way r is used to estimate the value ofp, 
The sampling distribution of r is shown by Sir R.A. Fisher to depen 
only on p and n. The standard deviation:of the distribution of ri 


Sie Ge 
approximately equal to a It is also shown that the distribution y 
` J4 n $ 


r is far from normal for large value of p, being sharply skewed in | 


the neighbourhood of p=. ` 


(n>400) and if p is only = oe 
moderately large, then r is 
approximately normally 
distributed with mean p and 


Lp 
standard deviation 2S á 


(Distribution of r for p=0 an 
Trg? p ` o==.8 when n=9) 


Thus the standard error of r 


is er but it is customary to take it as T inspite of the fact that 
n n | 


is a biased estimate of p. Thus the statistic Z = fre may be regarded 
l-r 


T 


as a standard normal variable. 


But this is not recommended for use because it is not valid if n" 

` small and p is large. It should be remembered that the distribution of! 
for small samples is skew and the skewness increases with P. 

We can change the non-normal distribution of r by a sim pt 


Pproximately normal distribution. This chant 
sher's z-transformation’, is from r to Zp where 


translurmation into an a 
of variable, known as "Fi 


Í 


~ STATISTICAL INFERENCE IN REGRESSION & CORRELATION 369 
S . ' ? V——————— 
» 2 doen l+r 
=>in = 1.1513 lo s 
Zy 2 bag (a 8 l- r 


Fisher (1890—1962) showed that the random variable Z, is approximately 


3 lel 
normally distributed with a m.an of u, = Qn a = 1.1513 log E 


! i 1 TA 
ang a standard deviation approximately cf Per Hence the statistic 
n= 


lêr] 1+p 
g [1.1513 log = z] [1.1513 log i. al =j 


ii 1/\n-3 mi 


is approximately N(0, 1), and we can assert that 
: ect 
P [-zuy2 <= < Zaja] z1-a 
1/\n-3 


This statement after simplification becomes 


z 
P [z -< Hz < Zp + ee) ~-a 
Jn -3 yn -3 , 
As corresponding to a particular value of r obtained from a 
particular sample, we have a particular value of zp therefore an 
appreximate 100(1—c) per cent confidence interval for p, is given by 


Thus we see that, in order to compute a confidence interval for the 


; truct a 
population correlation coefficient p, we have to neay ern al 
confidence interval for p, and then have to wansform it into an 
for p. 


: n hat 
Using Fisher’s z-table, we find values, denoted by Pz and pona, 


correspond to Fisher’s z-values equal to zy — 


2/2 
lke and + Ts 


Cc, 
n-3 n-s 


ý : r is 
Hence an approximate 100(1—c.) per cert confidence interval for P 


(pz, Py). ` 


370 


i Example 21.8 A random sample of size 28 pairs from a 
normal population showed a correlation co-efficient of 0.7. Fi 


confidence interval for the population correlation coefficient p. 


LU., M.Sc, 19) 
The degree of confidence is 0.95, therefore 20.025 = 1.96. 


INTRODUCTION To STATISTICAL ri f 
EGp, | 

bivati, 

nd a ok 


1 
The 95% confidence interval for 4, = 1.1513 log ere is 
=P 
1.96 ete Re 1.96 
ay 
Vn -3 ~~ Vn -3 
l+r 0 


Now 


1+0.7 | 
= 1.1513 lo = 1.1513 log $21 _'y 
id s EI- o.7 7 087 and 


L= 
n-—3 = 25. 
Substituting, we get 
(67 < Ll, < 0.87 + fee 
V25 7 V25 
. or 0.87 — 0.392 < Hl, < 0.87 + 0.392 
or 0.48 < h, < 1.26. 
, Using Fisher’s z-table, we find values, Pz and Py, that correspond to 
Fisher’s z-values equal to 0.48 and 1.26 respectively. 
Thus Pz = 0.446 and Py = 0.851. 


Hence the approximate 95% confidence interval for p is (0.45, 0.85). . 


21.5 HYPOTHESIS TESTING ABOUT CORRELATION 
COEFFICIENT 
As stated earlier, the sam 
of the population corr 
distribution nor a distri 
the sample size increase 


pling distribution of r, the sample estimate 
elation co-efficient p, is neither a normal 
bution that becomes approximately normal 85 
s. However, we change r into another random 


1+? which i 
l- 


variable, denoted by Z; and defined as Z; = 1.1513 log 


r 


approximately normal with a mean of k, = 1.1513 log l+p and & 
TF 


standard deviation approximately ofal 


. Thus the statistic 
\n-8 


Gan 
1/\n-3 


TATISTICAL INFERENCE IN REGRESSION & CORRELATION 371 
s 


is approximately standard normal regardless of the value of p and hence 
i 
provides a method to test the hypotheses about p. It is important to note 


_ that the standard error of Z; is independent o? P and that it is Zp rather 


than r, that is used for testing hypotheses about P. ` 
The random variable Zr is used to test the hypotheses that 
(a) population correlation co-efficient P is equal to a specified 
value Po, where Py is not equal to zero, i.e. Hy : P=Ppq (+0), 
(b) correlation coefficients of two populations are equal, i.e., Ho: 
' P1=P2 
To test the hypothesis Hp: p=0, the test-statistic to be used, is 


t= rn =2 which conforms to Student’s t-distribution. 
1-r? 
21.5.1. Testing the Hypothesis that p, the Population 
Correlation Coefficient equals Some Specified Value other than 
Zero. The procedure for testing the null hypothesis Hy : p = Po where 


Po ~ 0, is outlined below: 


(i) Set up the null hypothesis Hy : P = Po and formulate an 
appropriate alternative hypothesis. 


(ii) Decide on the significance level of size a. 


(iii) The test-statistic to be used is 


where Z, = 1.1513 log : > and pl, = 1.1513 log 
The variable Z is approximately standard normal. 
(iv) The critical region is Z < -24/2 and Z 2 Zaz for H; : P# Po: 
7 Z<-z, for H} : P<Po; 
Z 2z, for H,:P>fo- 
(vV) Compute the value of Z from the sample data. 
(vi) Decide as under: l 
Reject Ho, if z falls in the critical region. 


Accept Ho, otherwise. 


372 : INTRODUCTION TO STATISTICAL i | 
Example 21.9. A random sample of size 28 pairs from Ko Eory 
ivar., 

ates, . 


normal population showed a correlation coefficient of 0.7 Is thi 
: i 


consistent with the -assumption that the correlation coeffici ay Ug 
population is 0.5? i Cnt in ty 
(i) We set up our null and alternative hypotheses as . 
Ho:p = 0.5 and H,:p#0.5. 
(ii) We choose the significance level at & = 0.05. 
(iii) The test-statistic to be used in <his case is 
. z 2 Zr- l, 
1/Nn-3 
. _ ltr 
where Z= 1.1513 log-—— and j, = 1.1513 bp tf The 
-æ 1 = Po . 


variable Z is approximately standard normal. 
(iv) The critical region for a = 0.05 is Z < —1.96 and Z > 1 96, 
(v) Computations: We are given r = 0.7. p=0.5 and'n=28 


Now Z,= 1.1513 log $01 ae E 
T . og =1.1 = 
1-07 513 log 03 0.87, and 
A r 1+ 0.5 
H, = 1.1513 log = 1.15 15 _ 3 
1-0.5 te log g5 095 
0.87 — 0.55 


z = — = = (0.32 = 
1/28 —3 ) (5) = 1.60. 


(vi . . 5 
) nee the computed value z=1.60 does not fall 
iucat region, so we accept our null hypothesis and 


conclude that th elati : : 
eaea e correlation co-efficient in the population 


21.5.2. i 2 
Guilin. ta Hypothesis about the Equality of Two 
rendom samples of ae and rz be the correlation co-efficients of two 
es nı and no pairs, drawn from two bivariate 


no:‘mal populations with E P and p . The 
correlation co ffi i 1 2 
i efficients 
the hypothesis Ho : P 1 Po we calculate l | | 


Z = l+r iv 
f, = 1.1513 log — ang Zp, = 1.1513 log ——. 


1 


Since Z, and Z. ar on l1-r2 
the difference Z a p p A Toximately normally distributed, therefor? 
fi fp "0: P1 = P3 is true, is approximately normally | 
Tarra Messer 


a:~tributed with 
a mean zero and standard deviation W : t — 


= n Pae 
The te-*-statistic would then `a 


STATISTICAL INFERENCE IN REGRESSION & CORRELATION ` 373 - 
OO 


pa a 
1 1 
+ 


ny-3 ny-8 


` whichis approximately standard normal. 


The rest of the procedure for testing Hy: pP; = pz is the same. 


Example 21.10. A rardom sample of 28 pairs from a certain 
bivariate normal population gave r=0.6; another random sample of 23 
pairs from another bivariate population gave r=0.4. Test at the 0.05 


significance level the hypothesis æo : P4 = Pz against the alternative 


: H,:P1% Pa: 


(i) Our null and alternative hypotheses are 
Ho: Pi = P2 and Hy:p,#Po. 
(ii) The significance level is at a = 0.05. 
(iii) The test-statistic to be used under Hp, is 


ZrZr 


Z= 
1 1 
+ 


\ni-3 ng-8 


ltr, 
1l-r, 
The variable Z is approximately standard normal. 


1+rg 
and Zr, = 1.1513 log 
2 1-rz 


where Zr = 1.1513 log 


(iv) The critical region is |Z| 2 1.96. 
(v) Computations: 


0. 1.6 
= = 1.1513 log gz = 9.69, and 


1.1513 log 


i ï= 0; A 


N 
" 


1+ 0.4 1.4 
= ŻŻ = 1.1513 log — = 0.42, so that 
Zr = 1.1513 log ar 1.15 £06 


k 0.69 — 0.42 _ 0.27 _ 0.90 


——T— 


ue z=0.90 does not fall 


i ion. Si ce the computed val 
(vi) Conclusion. Since p t our hypothesis and 


in the critical region, so we accep 


q ient eviden 
conclude that the data present sufficien 
that the correlation co-efficients of the two populations are 


equal. 


ce to indicate 


’ 374 INTRODUCTION To STATISTICA if | 
R 


ORY | 


21.5.3. Testing the Hypothesis that p = 9g, We aie i 


interested in testing the null hypothesis that the Population correlat, t 

, : On ; 
co-efficient equals zero. That is, we wish to test Hy: p=0 (ie, there J 
no linear correlation between the variables X and Y), We have geo, , l 


z en t] 
the sampling distribution of r, the sample correlation coefficient, : thay 


i 18 sk 
when p is not zero. However, when p=0, the sampling distribution 4 
is symmetric. This property makes it possible to test the hypothesis H. 
p 2 0 by using the t-distribution. Thus when the rando i 


and Y are normally distributed and p=0, the statistic 


_rNn=2 
: V1l-r2 


has a Student’s t-distribution with v=n—2 uegrees of freedom. We Would | 


reject Ho, if 
SS 
t< -ta n-2) When Hy isp < 0, x 
t> ty in-2) When H isp > 0, “a 


|t| = ta/2,n-2) When H,isp +0. 


We -accept Ho otherwise and conclude that X and Y are not 


linearly correlated. It is interesting to note that testing the hypothesis 


that Hyp: p = 0 against H} : p # 0 is equivalent to testing the hypothesis . 


that Hy: B = 0 against H, : B # 0. Thus a test that will reject the 
hypothesis Hy: B = 0, will also r 


used to reject or to accept Hy: p = 0 is satisfactory for any sample size. 


An alternative way of testing the hypothesis Hy : p = 0 against 


H, : P #0 is the use of F-statistic. The statistic 


p- 20-2) 
l-r? 


F-distribution with v 1= 


-f in the numera‘or) by t2 = F, 


An F-test can also be used to test the hypothesis, 
Os : i 
Ho: p? = 0, the Simple coefficient of determination is zero, against 
H 


1: P? ¥ 0, the coefficien 
If F>F 
the procedur 


t of determination is not zero. 


a1, U2), we reject Ho, otherwise we accept Hy. The rest of 


e is the same, 


m variables l 


eject the hypothesis Ho: P = 0. A t-test ` 


land vg = n — 2 degrees of | 
sts are equal because the t-statistic is related to the | 


` migTICAL INFERENCE IN REGRESSION & CORRELATION 375 
T 
STATIS 


st the hypothesis about rank correlation, r,; (when the sample 
To ie too small nor too large), i.e. to test l 
si sam ranks of two population data sets are not correlated, against 
Ho: 


H,:The ranks are correlated, 
L 


the appropriate test-statistic is 


h, if Hp is true, has a Student’s t-distribution with v=n—2 df. We 
-whic ? 0 4 


i i itical region. 
ject H, if t falls in the cri : 
agin a 21.11. A random sample of 20 pairs chor ens 
Pes aes of correlation of 0.45. Test the hypothesis at oe > : 
i 3 Eiane that the correlation co-efficient in the populatio 
level of s 


a We state the null and alternative hypotheses as 
i 


Ho:Pp=0 and H,:p#0. 
(ii) The significance level is Set ata = 0.05. 
Gii) The test-statistic to use is 
_ryn-2 
ae rees of 
which, if Ho is true, has a t-distribution with (n—2)=8 degrees 
freedom. 


i = 2.10. 
i ritical region is |t| 2 ¢o,025,18) 
(iv) The critical region i ; 
(v) Computations. Substituting the values, we ge 


0.45 920-2 _ si ae 
4 0. 
pa d value of t=2.14 falls in the 
u =2. 
i ion. Since the computed va akee i 
(vi) se papri we therefore reject our S aes peed 
onl a ‘het ihe correlation co-efficient in the pop 
conclude 
from zero. 


Example 21.12. (a) Fin 


i f 18 
d the least value of r in : eet 
i is significant at : i 
irs f a bivariate normal population, that is sign! 
pairs rom 


; le 
included in a samp 
ions must be inc -hall have a 
airs of observations n lue 0.47 sha 
in sa i M iina correlation coefficient of va 
order tha : 


Sc. 1991) 
(P.U., B.A/B. 
calculated value of t greater than 2.12. 


tatistic 
(a) The significance of r is tested by the s 


s . INTRODUCTION TO STATISTica 


_ryn-2 : 


Substitutipg the value of n, we get . 


L THE, 


rvy1g — 2 4r 


, T= = = 
Ņyl-r? - 1-72 
For a significant vaiue of r, t should be 2to.025,(16) = 2.12 
4r 


eo | 


or 16r? > (2.12)? (1 — r?) 


That is 


or 20.4944 r2 > 4.4944 or r2 > 4.4944 


= 20.4944 7 9-22 


|r| = 0.47, 


Hence the required least value of r = 0.47. 


bo 


ENN = 


sien > 2.12 
-r 


Substituting values, we get ` 


0.47 vn -2 
> 2.12 


(b) Given t= 


1 — (0.47)2 
or (0.47)? (n — 2) > (2.12)2 (1 — 0.2209) 
a (n — 2) > (4-494) (0.7791) 
0.2209 
. or (n -= 2) > 15.85 or n > 17.85 
Hence the 


number of pairs must be 18, _ 


21.5.4, Testing Hypothesi 


Gomwletion § about the Equality of Several | 
a Ta, a Tp are the simple correlation 


; om sanpl À airs of | 
observations respectively, drawn fr ples of 3, ng, n.s Mp P j 


ne correlr:tion coefficients p,, Pa» -.-1 Py. We desire to test th? 
~+ hypothesis that the Population correlation coefficients are equal, b | 
Jo: PL=po=. =o, | m] 
i P1= P=.. = p, ( =P, say) against the alternative hypothesis 
1: Not all Population corre 


lation co-efficient are equal. 


om bivariate normal populations with | 


STATISTICAL INFERENCE IN REGRESSION & CORRELATION 377 


To test this hypothesis, the values of r are converted to 2rvalues 
(Fisher’s z-transformation). That is, we define Zr Zp) aay Zp, by the 


, relation 


Z, = 1.1513 log LE. 
fi l-r; 
A L 


We have already seen that for large samples, Zr, are all distributed 


5 1+ 1 

ly with E(Z,) = 1.1513 log HÊ ona Varz) = 
normally fi 8 1-p ` (Za n—3 
Furthermore, the best linear- unbiased estimate of 1.1513 log PAA is 


E (n; - 3) Z; 
given by Z = -5 7 


Xini = 3) 


k x 
u= Ð (n;- 3) (Z; - Z)? 


i=1 


. It has been shown that the statistic 


is distributed approximately as x?-distribution with (k-1) degrees of 
freedom when the null hypothesis is true. The rest of the test-procedure 
is the same. . 


21.6 INFERENCE IN PARTIAL, MULTIPLE CORRELATION 
AND REGRESSION % 
Hypotheses about the partial correlation, multiple correlation and 

regression may be tested in the same way as those of simple correlation 

and regression. 

21.6.1. Testing the Hypothesis that a Partial Correlation 
Coefficient of Order k in the Population is Zero. Assuming ara 
and X;’s are jointly normally distributed and the null hypothesis n the 
population partial correlations of order k (ie. with k secondary 


` subscripts), estimated from a random sample of size n, are individually 


equal to zero, R.A. Fisher has shown that the statistic 
pq ee Ve 
j 2 
1-19.34. 

; ; es if the null 
follows a Student’s t-distribution with v=n-k : a k santia 
hypothesis is true. We reject 1p, if the calculated va ais illustraced by 
critical region, otherwise we accept it. The procedure 
Example 21.13. 


378 INTRODUCTION TO STATISTICAL THe, 
————— See Y 


An Alternative way of testing the hypothesis that a 
partial correlation of order k (k denotes the number 


z of Secondary 
subscripts) in the population is zero, that is 


Ho: P12,34..p = © (i.e. a particular partial correlation of 


Order 5 
is zero), and 


Fy: P12.34..p £ 9. 


is based on F-test. If rj2 34, is the corresponding partial correlati 
calculated from a sample of size n, then the statistic 
2 
"12.34..p 07k- 2) 
F= 7 
1-119 34..p 


_ has, if Ho is true, an F-distribution with v,=1 and Vo=n—k—2. We reject 
Hp, if F > F (l, n-k-2). Otherwise, we accept Ho. 


Hypotheses about the equality of several partial correlation 


coefficients of order k can be tested in the same way as the equality of | 


several simple correlation coefficients, the only difference would be to 


reduce the effective size of sample n; by the number of variables held 


constant, i.e. k . 


Example 21.13. From a random sample of 21 sets of values froma 
normal population the calculated value of a partial correlation of order 
three is 0.40. Is this consistent with the hypothesis that the 


corresponding partial correlation in the population is zero? Let œ=0.05. 
(i) We state our null and alternative hypotheses as 
Fy: P12.345 = 0, Le., a partial correlation: of order three in the 
population is zero, and 


Fy: P12,345 ¥ 0. 


(ii) The significance level is set at q = 0.05. 
(iii) - The test-statistic to use is 


MN 
t= "12.345 V2—-k-2 
Sa Nee 


2 
1=To a5 


which, if Hy is true, has 


a t-distribution with v=n—k—-2 degrees 
of freedom. Here k =3, so 


d.f. =21-3-2= 16. 
The critical region is lt] > t0.025 (16) 
Computations. Substituting the valu 


(iv) 
(v) 


= 2.12 


es, We get 


parti cula 


STATISTICAL INFERENCE IN REGRESSION & CORRELATION 379 


(0.40) \/21-3-2 1.60 
t= === 1.74 
y1- (0.402 0.92 


(vi) Conclusion. Since the calculated value of t=1.74 does not fall in 
the critical region, so we accept Ho and conclude that the given 
partial correlation coefficient is consistent with the hypothesis 
that the corresponding partial correlation in the population is 
zero. 


21.6.2. Testing the Hypothesis that a Multiple Correlation 
Coefficient is Zero. To test the hypothesis that a multiple correlation 
coefficient in the population is zero, we compute the corresponding 
multiple correlation Ry12..p from the sample of size n from a 
(p +1)-variate normal population and use the statistic 


a R is (n-p-1) 
= y-12...p 


2 
(= Ry 12..p)P 


which, if Ho is true, has an F-distribution with v} = p and va=n—p-1 
degrees of freedcm. This test is equivalent to testing the hypothesis 
Ho: Bi = B2 =... = Bp = 0 simultaneously. A test of such a hypothesis is 
also called a test of the overall significance of the estimated multiple 
regression. The rest of the test procedure is the same. 

Example 21.14. The marks in statistics (Y) are expressed as a 
function of marks in mathematics (X,) economics (X3) and intelligence 
tests (X3). For a random sample of 50 students, the multiple sgn 
co-efficient Ry 393 was found to be 0.582. Test the hypothesis that the 


i relati -efficient in the population is zero. Let &=0.05. 
multiple correlation co-efficien pop Gu, ee a 


G) .We state our null and alternative hypotheses as _ 
Ho: The multiple correlation co-efficient in the population is zero, 
i.e. Họ : Py. 193 = 0, and 
Hi : Dy. 123 # 0: 
(ii) The significance level is set at a = 0.05. 
Gii) The test-statistic, if Hy is true, is 
2 => — 
_ R, 193 (n-p-1) 


= 2 
Q -= Ry 93)? 


; bpa rees of 
which has an F-distribution with v,=p and vọ=n-p-1 peges 
freedom. Here p = 3. 


378 INTRODUCTION TO STATISTICAL THE, 
SNO ai ee a f 


An Alternative way of testing the hypothesis that a 
partial correlation of order k (k denotes the number of 
subscripts) in the population is zero, that is 


Particu, 
Seconday | 


Ho: P12.34..p = 0, (ie. a particular partial correlation o 


f order, 
is zero), and k 


Fy: P12.34...p £ 0. 


is based on F-test. If rj 34 is the corresponding partial correlation 
calculated from a sample of size n, then the statistic 
2 
"12.34...p Tk- 2) 
F= 7 
l-r 
12.34...p 


. has, if Ho is true, an F-distribution with v)=1 and ve=n—-k-2. We reject 


Hp, if F > F (l, n—k-2). Otherwise, we accept Hp. 


Hypotheses about the equality of several partial correlation 


‘ coefficients of order k can be tested in the same wey as the equality of 


several simple correlation coefficients, the only difference would be to 
reduce the effective size of sample n; by the number of variables held 
constant, i.e. k . 


Example 21.13. From a random sample of 21 sets of values from a 
normal population the calculated value of a partial correlation of order 
three is 0.40. Is this consistent with the hypothesis that the 


corresponding partial correlation in the population is zero? Let a=0.05. 
(i) We state our null and alternative hypotheses as 
Ho: P12.345 = 0, Le., a partial correlation: of order three in the 
population is zero, and 
H1 : P12,345 #0. 
(ii) The significance level is set at q = 0.05. 
(iii) - The test-statistic to use is 


N 
E 12.345 Vn—k-2 
oo) LANG NEE EA 


12.345 


which, if H 
of freedom. Here k=3, so d.f. = 21-3 


o is true, has a t-distribution with v=n—k—-2 degrees 
-2 = 16. 
= 2.12 


es, We get 


(iv) 


The critical region i 
gion is |t| >¢ 
0.025,(16 
(v) (16) 


Computations. Substituting the valu 


> 


STATISTICAL INFERENCE IN REGRESSION & CORRELATION 379 


(0.40) V21-3-2 1.60 
t= =e S74 
V1 — (0.40)2 0.92 


(vi) Conclusion. Since the calculated value of t= 1.74 does not fall in 
the critical region, so we accept Ho and conclude that the given 
partial correlation coefficient is consistent with the hypothesis 
that the corresponding partial correlation in the population is 
zero. 


21.6.2. Testing the Hypothesis that a Multiple Correlation 
Coefficient is Zero. To test the hypothesis that a multiple correlation 
coefficient in the population is zero, we compute the corresponding 
multiple correlation Ry.12..p from the sample of size n from a 
(p+1)-variate normal population and use the statistic 

R? i2.p (n7Pp-1) 
F = 212p 
(1-R?, )p 
y:12.p 

which, if Hg is true, has an F-distribution with v; = p and vg=n-p-1 
degrees of freedem. This test is equivalent to testing the hypothesis 
Hy: Bi = Bo =... = Bp = 0 simultaneously. A test of such a hypothesis is 
also called a test of the overall significance of the estimated multiple 
regression. The rest of the test procedure is the same. 

Example 21.14. The marks in statistics (Y) are expressed as a 
function of marks in mathematics (X,) economics (X3) and intelligence 
tests (X3). For a random sample of 50 students, the multiple re 
co-efficient Ry 393 was found to be 0.582. Test the hypothesis that the 


i relati -efficient in the population is zero. Let a=0.05. 
multiple correlation co-efficien pop a 


Gi) .We state our null and alternative hypotheses as — 

Ho: The multiple correlation co-efficient in the population is zero, 
i.e. Hy: Py.123 = 0, and 
#1): Py,123 # 0. 

(ii) The significance level is set at & = 0.05. 

(iii) The test-statistic, if Hg is true, is 

2 -= — 

1: Ry 123 (n-p-1) 


= 2 
(1 — Ry 393)? 


A =n-p- rees of 
which has an F-distribution with v,=p and vọ=n-p-1 degre 
freedom. Here p = 3. 


; INTRODUCTION TO STATIST 
380 ooo aoao u TTO TO STA ICAL Tto 
(iv) Computations. Substituting the values, we get 


| 
(0.582)? 50-3- 1) _ 15.58 _ , = | 
F=- (0.582)2] (8) 1.98 , 


and v; = 3, vo = 50-3-1 = 46 degrees of freedom, 

o) The critical region is F > Fo o5(8, 46) = 2.81. 
(vi) Conclusion. Since the computed value of F=7.87 g 
critical region, sc we reject Hy and may conclud 


multiple correlation coefficient in the population q 


iffers ftom . 
zero significantly. . 


`21.7 ANALYSIS OF VARIANCE FOR REGRESSION 


The procedure of analysing or partitioning the total variation in the 
dependent variable Y;, into its components: one explained by the. 
regression line and the other residual or unexplained part, is called | 
analysis ‘of variance for regression. Two independent and Unbiased | 
estimates of the population variance are obtained by dividing the two | 
parts by the corresponding number of degrees of freedom. Their ratio 
under the assumption of normality of the values of Y; and the variables 
in the population having no regression, has an F-distribution with | 
appropriate degrees of freedom. The test procedure is 
in an analysis of variance table. 

21.7.1. ANOVA for Sim 
Hy: B = 0. Given a random s 
from a bivariate normal population, in which the variables are assumed | 
fo be unrelated; ie, B=0. Let the estimated linear regression be 


Y;=a+bX;. If ¥ denotes the mean of the n values of Y;, we can construct 
the following identity | 


usually arranged | 


ple Linear Regression and Test of | 


Y-¥=%-¥) + -P 


Squaring both sides and summing over all values, we get 


n es n. A nog n A a = 
pes Bp a FE va Soy SGP 
z i=] i=l i=] 
The cross-product term is evaluated as below: . 
n A a n 
zo, “H M-e Sy, a gk. bX;) (a + bX; - ¥) 
i=1 


ii 


a (ZY; ~ na —b5X;) + b(EX;Y; - aXX; 
= DEX’) Ze: ¥ (LY; —na- bK) | 


normal equations when we derive a end b. Thus 


‘alls in the | 9 
e ‘that te 


ample of n pairs_of observations (Xi Y) | 


{TISTICAL INFERENCE IN REGRESSION & SORRELATION / 381 
SIS SSS o a n aaO 
But ÈY; — na — 2X; = 0 azd DX, - aÈX; — b&x? = 0 as they are the 
vanishes and we are left with 


n A 
EQ EO- Hrs 
L 


i=1 i=] 


‘Hence, we find that the amount of ~ariation in the dependent variable 


iven by L(Y; - ¥Y)? and called Total Sum of Squares, has been 
artitioned into two parts: (i) that which has been explained by the 


- regression line, i.e. LAY; - Y)? and referred to as the Sum of Squares due 


to regression, and (ii) that v-hich is left unexplained by the regression 
line, i.e. Z(Y; — Y;)? and called the Sum of Squares of residuals or errors. 
Thus the equation partitioning the total sum of squares may be written 
s Total SS(SST) = Residual or Error SS(SSE)+ Regression SS(SSR; 

Now the number of degrees of freedom corresponding to X(Y;-Y)? is 
n—1, as there is one restriction of computing Y from the data, ae 
nuiaber of degrees of freedom corresponding to L(Y; — Y;)? is n—2, as its 


computation ‘s subject to two restrictions (i.e., computation of a and b) l 


and corresponding to Ł(Y; — Y)2, the df = (n - 1) - (n - 2) = 1. 
On the hypothesis that H) : B = 0, ie. the SS z 
unrelated, we get two independent and unbiased estimates o 
: os . $ . e d 
variance Gr when the two parts, viz. Regression sum of squares an 


: : F 
Error or Residual sum of squares are divided by the respective snee : 
freedom. That is, Mean Square Error (MSE) and Mean Sqr 


: 2 tec: mean 
Regression (MSR) are both estimates of Oy y- The expec 


a 


Squares can be shown as 


2 F -y2 
ar e P and E(SSR) = Oyy + Æ XX; - X) 


-2 
A larger value of MSR suggests that B is not zero. D 
Hence, if Hy: = 0, ie. the population has no regression, is true, 
, 0: ; l.e. 
the statistic 
MSR 


F = MSE 


= No-2 df. We wourd reject 
follows the F-distribution with v,=1, and v2 = n92 df. 


Ho : B = 0 when F exceeds Fa, n—2) 


the cross-product |‘ 


eens cetera a 


362 INTRODUCTION To STATISTICAL 


TH 
EORy STATISTICAL INFERENCE IN REGRESSION & CORRELATION 
phe mas 


The computation of.the various sums of squares can be simplifi | apa d 
below: i ed ag : D 2 © ji 
_ (ry)? | .Now MSR = nr?s? and MSE = OOS, f 
Total SS(SST) = Z(Y; - ¥)? = DY? - | n-2 
k | MSR r2(n—2) i 
Residual or Error SS(SSE) = L(Y; - Y;)? | i kE MSE ~ TO 3 i 
| i 
= ZYabX) t= ZY; -aY oyyy which, when Hy ki = : is vi follows the F-distribution with R 
Regression SS(SSR) can be found either by subtraction or | "15 hai 2 =n- 2 d.f. , ; 
| ‘ computed directly as > OF can be We note that the statistic F is the square of t, ie. F = t2 when v=1. 


- | Hence this method is equivalent to testing the hypothesis Hy:p = 0. 
SSR = =a 2 = =e PL ' 
R = X(Y; - Y)? = Ya + bX; — Y)? = aX, + brUX,Y, -Qy | Example 21.15. Test the hypothesis that B=0 at the 0.05 level of 


| i Ş 5 | significance for the data in Example 21.5 by setting the results in an 
| : 5 = = ZU; -D Y, - YP = b2Y (xX. — H2 ` analysis of variance table. i 
i D(X; -52 17%). 


The stepwise solution is given below: 


| These results are usual] 
as follows: ` 


y arranged in an analysis of variance table | (i) We state our hypotheses as 


Source of | Ho: B = 0, ie., the two variables are unrelated or there is no 
Variation | , regression, and 


Hı: B # 0, i.e., the two variables are related. 
(ii) The significance level is set at a = 0.05. 


(iii) The test-statistic to use is 


F= Regression Mean Square i 


Residual Mcan Square 


It is interesting to note 


| 
h which, if Ho is true, has an F-distribution with v,;=1, vg=n-2 f 
expressed in terms of : at the above-mentioned SS can be degrees of freedom, assuming that population is normally | 
| f ofr, the correlation co-efficient. Thus we have distributed. ; 
i 
| xy, -2 = b25 (X. — H2 Es 2 9 | (iv) Computations. To set up the analysis of variance table, we 
| izter 2n. S= nr?S,, and compute the necessary sums of squares as below: 


x 


In Example 21.5, we found LY; = 1700; DA = 246,100; i 
YX; = 109,380, n = 12, and the estimated :egression line as | 
= 2WXY;-P) X- Y= -179.42+5.03 X;. Now, we find 


| EY,- ¥)2= Diy. -7 š Y-¥ x 
| J? = Diy, -W-D ty ¥-F 24, BT 
= L(Y,- P)? + vex, — He 


= ng? 2 
ns, + nb?s" = 2br S,S, Y)? 


l 
i 
n 4 
g 


Total SS = EY; - J? = EY; - 


2 2 a 
= nS" + nrs? 202 - 
y S= 2nr S; 


1700)? : | | 
=n(l-r?) s? . | i = 246100 - a = 5267; f 


384 l i IN T.ODUCTION TO ST 


` Residual SS = L(¥,-Y))? = ZY? -aLy,- brX.Y, 
= 246100 — (-179.42) (1700) — (5.03) (109339) 
= 246100 + 305014 — 550181 = 983; and 


Regression SS = Total SS — Residual SS 
= 5267 — 933 = 4334. 
The ANOVA-Table is 


Source of | d.f. Sam of Mean 
Variation Squares | Square 
i oll 4334 = 
Residual 10 933 -- 


as 


(v) The critical region is F > Fy (1, 10) = 4.96 
(vi) Conclusion. Since the computed value of F=46.45 falls in 
the critical region, so we reject our hypothesis and may 
conclude that the variables in the populaticn are related. ` 
21.7.2. ANOVA for Multiple Regression 
Hypothesis about the B Pcrameters. Let us first consider a case of 
linear regression on two variables X ı and X4. Suppose we have a random 
sample of n sets of observation KY; Xip Xə), Í = 1, 2, ..., n] from a 
trivariate normal population having Bı=B2=0, where B’s are the 
regression coefficients, 


Let the estimated regression line be y =a + bX, + bX. Then the 
, following sum-of-squares identity is obtained: 


n 4 non E n a 
2Y - ¥)2 = È (Y; - /2 $ È (Y; - Y;)2 
isl i=l i=l 

. ho total SS has as before n~1 degrees of freedom, while the 
esidua! SS Fas n minus the number of parameters estimated by the 
regression equation (in this case 8). The 
; following ANOVA-table; 

Source of 

Variation 


Regression 


and Testing 


Regression 


Residual 


ATISTICAL THEOR, 


TATISTICAL INFERENCE IN REGRESSION & CORRELATION l 385 
S z S| 


When the null hypothesis Hy: B} = B, = 0 is true, MSR and MSE 
are found to be dependent and unbiased estimates of the variance 62, 
Hence the variance ratio F has an F-distribution with v,=2, vg=n-3 
degrees of freecom, and we reject the null hypothesis when F exceeds 
F (wp V2). This test is frequently called a test for the overall significance 
oR the estimated multiple regression, Srg it simultaneously tests the 
hypothesis that each B; = 0 Q > 0) against 


H, : At least one of the B; + 0. 


The sums of squares can be computed by the following formulas: 
SST = L(Y; - ¥)? = EY? - (£Y p)?/n; 
SSE = L(Y; - H= 2Y? -aŁY; -b X,Y - by YX,Y; and 


SSR = SST - SSE. 

We may generalise this discussion by considering a linear regression 
on p variables X}, Xo, ... X,. The random sample in this case will consist 
of n sets of values [(Y; Xip X2 1 Xpi) 1 = 1, 2, .. n]. The Ẹ+1)- 
variate population is again assumed to be uncorrelated normal with 
variance O2 and we wish to test the hypothesis that the population 
regression co-eff.cients equal zero, i.e., Hy: B, = By = ... = Bp = 0. Let 
the estimated regression equation be 

Y =a +.b)X + bX +. + Xp. 
Then the sum-of-squares identity will be 


t E n a nia 2 
ZY -= Ly- + LY-P 
i=l l=1 reu 
Here the Residual SS wili kave 2-p-1 degrees of mae AA 
parameters are tg be estimated ard Regression SS term i 
degrees of freedom. The analysis of variance table would be as Sž 
Sw Comp...:ed 


Source of Mean Square 


Variation 


Sim of Sque.es 


ee 9 aod SSR 
SSR=L(Y,-¥)" | MSR =~ 


Regression | 
SSE 
}-p-1 


Residual SSE=X(Y,-Y;)? | MSE- 


ssT=L-¥2_ 


386 : INTRODUCTION TO STATISTICAL TH 
AE AO we E EON OSTAT 
8380 


It has been shown that the two Mean Squares » ‘cvide indepen A 
and unbiased es imates of 62, when the null hypothesis Hy: B, = xI 
.. = Bp = 0, is true. Hence the statistic F has an F-distribution val 
v} =p, Vo =n —p — 1 degrees of freedom and we reject Ho if F ercan, 
Fn — p — 1). Rejection of Ho implies that at least on: 9°, | 
independent variables X}, X3, ..., X, contributes significant! q Y 
regression model. | 

The simplification of the various sum of squares-terms describ! 
above, gives interesting results. If R denotes the multiple correlation wa 
efficient (the order of the co-efficient corresponds to the number | 
independent variables used in the regression equation), then - 


Regression SS = DY; - Y)? = R?X(Y; - ¥)2, and 
Residual SS = X(Y; - Y)? = (1 - R?) Diy, - 2. 


Therefore, the statistic F, the ratio of two unbiased estimates of 02, vill 


have an F-distribution with VU, =P, Vg=n—p—1 degrees of freedom and wil 
be used to test the null hypothesis that the multiple correlation ce | 
efficients in the (p+1)-variable population do not exist. The results can 
be assembled in an ANOVA-table as below: 


Source of 
Variation 


f 
f 
| 
| 
| 


Regression SSR HRY -¥)? 


Residual 


SSE=(1-R2) (y,-¥)2 


Wn — py — 
- Computed F = MSR 2 Rin -p-1) 

, MSE (1 - R?)p 
Hence, in this case, we ob 
multiple correlation co-effi 


Serve that the statistic F depends on R, the 
cient and the degrees of freedom. 


Examgle 21.16. Consider rhe following set of data: 


6 5 
1 2 
4 3 


(a) Fi i y 
_ Find the estimated regression 2quation Y = a + bX, + bev, 


TATISTICAL INFERENCE IN REGRESSION & CORRELATION 
S 

(b) Obtain the ANOVA table and test tha h 
association between either regressor 


Use the 0.01 level of significance. 


ypothesis th:t thee is ny 
and the depen. on: variable. 


G.U., M.Sc. 1995) 
(a) The equation of the esimrmated multi 


Y =a + bX, + boXo, where a, b; and 


bo are the least-squares 
estimates of a, B, and B, respectively, 


The sums needed to. calculate a, b and bo are found to be 


XX, = 23, EX, = 40, EX} = 105, Dx? = 294, ÈX,X, = 173, 
LY = 67, LY? = 815, ZYX, = 290, ZYX, = 488 and n = 7 
Now substituting these sums in the Normal Equations, we get 
7a + 23b; + 40b; =67 
23a + 105b, + 173b = 290 
40a + 173b; + 294b, = 488. 
Solving them simultaneously, we obtain 
a = 0.7379, b} = 1.0123 and by = 0.9638. 
Hence the desired estimated multiple regression is 
l Y = 0.7379 + 1.0123 X, + 0.9638 X, 
(b) To obtain the ANOVA table and to test the hypothesis Hy: 8, = 
B, = 0, we proceed as below: 
(i) We state the null and alternative hypotheses as 
H:8 ,=B =0, i.e. none of the regressors is significant; and 
H:At least one of the B, and By is non-zero. 
(ii) The significance level is set at 0=0.01 
(iii) The test-statistic to use is 
MSR 
' F = MSE 
which, if Họ is true, has an F-distribution with i 
və=n-3 d.f, assuming that population is norma-y 
distributed. a 
(iv) Computations. To set up the ANOVA able, we find ine 
necessary sums of squares as below: 
Y)? 
Total ss = EY; - M? = LY; ~ na 


2 
= 815 — ÊP- = 173.1143 
í 


387 
EA 


ple linear regression is 


388 INTRODUCTION TO STATIS CAL INFERENCE IN REGRESSION & CORRELATION 
Ret a —$—$—$—<— 


The ANOVA-table is 


Source of df. 


Mean Square | 
F || 
207.14 || 


Variation 


Regression 2 172.0532 86.0266 
Error 4 1.6611 0.4153 


[a 


Regression SS = L(Y; - Ý? 
= a} Y + bı 2X,Y + bo 2 X,Y = (LY,)2/p 

= (0.7379) (67) + (1.0123) (290) + 
(0.9638) (488) - 


(62/2 
= 172.0532 


Residual SS = Total SS — Regression SS 


| 
173.7143 — 172.0532 = 1.6611 | 


W 


Sum of Squares 


[awe | a 


| 
| 
j 
| 
| 
| 
| 


n =20, EX = 40, ZY = 60, DX? = 95, EY? = 297, ZXY = 15% | 
mention what assumptions you have made in your conclusion | 


| 


RELATION -< "889 
vhe heights (X) and weights (¥) of 100 individuals pive 
ZX; = 6826, LY; = 16440, EX? = 466540, LY? = 2766596, 
“SXY = 1124828. 


G) Petermine the slope for the regression line Y; = a + BX; + £; 
nccations have usual meanings. 


21.4 


(ii) Assuming normality, find a 95% confidence intérval for a and B, 
the regression parameters. (P.U., M.Sc. 1981) 


21.5 The age (X) and systolic blood pressure (Y) of 100 individuals 
gave the following sums: 
EX; = 4421, LY, = 12130, EX? = 208849, LY? = 1498976, 
EX;Y; = 542735. 
(a) Compute the regression line which is used to estimate 


by. x: 
(b) Assuming normality, 


{ (v) The critical region is F > Fo.01(2, 4) = 18.00 (i) construct 95% confidence intervals for & and B, 
(vi) Conclusion. Since the calculated value of F = 207.14 falls in the | Gi) predict ona pee Made mi sii eons 
critical region, so we reject our null hypothesis and conclude that! RO he NE Ea a U M Sc 1984) 
there is an association between at least one of the regressors ani ays te a 
the dependent variable. | 21.6 From 10 pairs of values (X;, Y;), i = 1, 2, .... 10, the following 
i quantities are calculated: 
| , EXERCISES DX = 311, LY = 310.1, EŽY = 10,074, EX? = 10100 and 
21.1 In a regression problem; find the mean and variance of ŁY? = 10,055.99. 
(i) b, the estimate of B, | Assuming normality, 
(ii) a, the estimate of a, and | G) find 90% confidence intervals for B and Hy. x=30 
e p Y, the estimate of the mean value uy y for a given X value. Gi) test the hypothesis that B=1 against the alternative that 
a 2 (a) ele the least squares estimates a and b are unbias) Be1 (1.U., M.Sc. 1990) 
and also find their vari ; | : ; 
(b) What are th E ries | 21.7 Fit a regression line Y = a + bX to the following data and m 
are the properties of the sampling distributior. of b, ‘i the aypothesis Ho : B = 0 at 5% level, where ĝ is the population 
estimate of B? E : Bibs -fi ae l oe a 
sesesteaucbad B? How is a confidence interval for | regression coefficient: ; 
a i = 246100 an 
| i | = 12, “X= = 1700, ZX? = 49068, ZY? : 
21.3 Describe how to construct confi : $ oR the! n = 12, UX = 766, LY (P.U., B.A./B.Se. 1987) 
| repre f dence interval fo a ZXY = 109380 
re noson co-efficient, illustrating your answer by considerit] ' 3 O, SXY=150. 
gression of Y on X from the following data: 218 Given n=20, DX=40, LY=60, LX?= 95, the hypothesis that 
Determine the regression of Y on X and test 


_ B.A/B.Se. 1993) 
the two variables are independent. PU, BAS 


Y 


390 


INTRODUCTION TO STATISTICAL 
{i 


P È 
21.9 . The data given below represent the heights (X) ang the N ; 
. (Y) of five men. We selected the heights in advance ang | 
observed the weights of a random group of men havin ey 
selected heights. . £ tte 
X: 60 62 65 70 72 | 
Y: 130 — 135 158 170 185 

Using 5 per cent significance level and assuming normality | 
gr” 

(i) test the hypothesis Hy: B = 0 against B x 0, | 
(ii) test the hypothesis Hy: B = 6 against B #6. | 
(iii) predict the weight of an individual who is 66 inches t 


21.10 


21.11 


height. Give a prediction interval. (P.U., M.Sc. 1979) 
I 


Find the linea’: regression equation from the following data: 
Y 85, 74, 76, 90, 85, 87, 94, 98, $1, 91, 76, 74 

Assuming normality, test the hypothesis 

(i) 

(ii) 


Hy: 8 = 0 against H, : B #0; 
Hy: = 32 against H} : a # 32; 
at the 0.01 level of significance, 


(a) Describe how to construct confidence interval for the | 
difference between two linear regression co-efficients of two 
regression lines, 

(b) The various sums for two set 


s of data, each of 4 observations | 
are as follows: 


(i) mag the estimates of P, and By, the regression coefficients | 
of two linear regression lines. | 
(ii) 

st 
Bi-Ba and ie 
=B, against H, : B, # By at the H 


M.Sc., LU., 1990, P.U., 199) 


Construct the 95% confidence interval for 
the hypothesis that By 
level of Significance, 


sTA 
91.12 


21.13 


21.14 


21.15 


TISTICAL INFERENCE IN REGRESSION & CORRELATION 


391 


Various sums for three sets of data, each of four observations, are 
as follows: 


(i) Determine the estimates of B,, By and B3, the regression 
coefficients of three linear regression lines. : 
Gi) Test the kypotheses B, = Bos By = B3 and Bı = Ba and 


state whether the lines can be regarded as parallel. 
(P.U., M.Sc. 1987) 


Use the following data to test the hypothesis that the regression 


. is linear at the 0.05 level of significance: 


222 33 45 5 666 


Y 4 3 8 18 22 24 24 18 13 10 16 


A survey of the pocket money received by children in a primary 


school was made by choosing at random four children of each of 
received are given below: 
Age (years) Pocket money (Rs.) 
5 

9, 13, 14, 16 

9, 14, 16, 21 
Find the regression equation for predicting pocket money iow 
age and test for linearity of regression. Use a 0.05 leve K 
The amounts of a chemical compound Y, which apne a 
grams of water at various temperatures, X were recor 


the ages 5, 7, 9 and 11 years. The amounts of pocket money 
« 2, 8, 10, 12 
7 
9 
11 18, 19, 23, 36 
significance. 
follows: 


392 INTRODUCTION TO STATISTIC EN 
F AL Eo, i STATISTICAL INFERENCE IN REGRESSION & CORRELATION 393 
A N 


that the| 21.21 (a) Explain the . procedure for testing a hypothesis that the 
| population correlation co-efficient equals zero. 
| 


- Fina 2 regression equation and test the hypothegj, 
regression is linear at the 0.05 level of significance. 


21.16 (a) How will you find a confidence interval for Popul 
atin 


correlation coefficient? | (b) A random sample of 28 observations gave a correlation co- 


f efficient of 0.45. Test the hypothesis at the 0.05 l 
; ; . ; l 
(b) A correlation co-efficient of 0.2 is obtained from a » significance that the population correlation co-etticent : 


sample of 625 pairs of observations. Find the 99% Pea) zero? 
: ; ath se a en | `i 
interval for the coefficient of correlation in the Population, 7 21.22 (a) A sample of 10 pairs of observations yidis a cordato 


21.17 (a) A random sample of size n=23, taken from a Biva coefficier.t of 0.7. Is it reasonable to suppose that such a value 
normal population, showed a correlation Coefficient of fe would arise from a population where the coefficient is 0.85? 
Compute the 90% confidence interval for p. = 


| (b) Is the value of 0.7 itself significant? 
(P.U., B.A./B.Sc, 1939 


(c) Another sample of 12 pairs of observations shows a 


(b) The lengths (X) and breadths (Y) of 243 cuckoo eggs si; coefficient of 0.9. Is this likely to be from the same population 
measured (in mm) with the results: mT as the first? 


ÈX = 5442.2, EX? = 122155.04, DY = 4019.6, | 21.23 (a) A random sample of 20 pairs of observations from a bivariate 
DY? = 66588.92 and CXY = 90113.83 | . normal population gives a correlation coefficient of 0.55. 


| 
Give a 95% confidence interval for , <n Using &=0.65, test the hypothesis that the variables in the 
coefficient, ee seen correla l population are uncorrelated. (P.U., B.A/B.Sc. 1989) 


21.18 (a) In a class of 25 students we find a correlation co-efficient 7 b) A random sample cf 27 pairs of observations from a bivariate 
0.731 between the scores on two tests. Find the sie | normal population gave a coefficient of correlation of 0.30. Is 
confidence interval for the valu eof the correlation eoatticleat | it consistent with the hypothesis that the coefficient of 
in the population. (LU MSc 1993) | correlation in the population is zero? Use 5% level of 

(b) A random sample of size 20 from a variate normal | ; significance. (P.U., B.A/B.Sc. 1996) 
Sagoo showed a correlation coefficient of 0.92. Find 21.24 (a) A sample of size 12 yielded r=0.32. Test Hy: p = 0 against 
o confidence Interval for the population correlation | H,:p #0. Leta = 0.01. 
coefficient? (L.U., M.Sc. 199) f ivari al 
21.19 aaae (b) A random sample of 19 pairs from a bivariate norm 
` (a) Test the hypothesis that P = 0.7, if a sample of 50 gave | population showed a correlation of 0.65. Is this consistent 
m=O, (P.U.,°B.A./B.Sc. 1983) . with the hypothesis that correlation coefficient in the 
(b) A value of r of 0.6 is calculated from a random sample of 39 | population is zero? 7 
pairs of observations from a bivariate normal population. ls | (c) A study was made to determine whether the na Sour a 
this value of r consistent with the hypothesi h =04 E . the holder of a share of stock is positively correlated wi 
21.20 (a) Two indepeng VRSK TESS ieee price of a share of stock. A sample of n=20 stocks ape 
Ge om | ; 
observa ith amples have 28 and 19 pairs o correlation coefficient of r=0.78. Use the F-test to test the 


tions with correlati : 016 | 
respectively. Are these vate ai aes ex on ‘ab | hypothesis of no correlation. Let @& = 0.05. 
hypothesis t% Consistent W | ‘ in a random s 
enue ae nat the samples have been drawn from the same | 21.25 (a) What is the least value of r in a ra 


| is significant (i) at the 0.05 level, (i 
(b) A sample of 67 È | thabis slenificait @ 


sample of 39 A ae a correlation of 0.72 whereas anothel | 


5 e a correlation of evel ’ prina 
of significance t of 0.84. Test at the 0.051 (b) Find the least value o 


alternative H 


ample of 38 pairs 


i) at the 0.01 level? 
(P.U., B.A/B.Sc 1977) 


sample of 27 pairs from a 


i ivari i is significant at the 5% 
he hypothesis Ho : P} = Py against th? | i bivariate normal population that is A See. WES 


1: P1 # Po. 


level. 


394 


21.26 (a) Describe how to test a hypothesis about the 


21.27 


21.28 


| 21.29 


21.30 


INTRODUCTION TO STATISTICAL Ti 


(c) How many pairs of observations must be i 
sample in order that an observed correlation 
value 0.42 shall have a calculated value of + 
2.72? l 


ncludeq i 
co-efficient s 
&reater than 


several correlations. 


(b) Random samples of 10, 15 and 20 are drawn fro 
normal population, yielding r = 0.3, 0.4, 0.49 
Form a combined estimate of p and test the hy 
the correlations are homogeneous. 


ma bivariate 
respective, 
pothesis tha 


Samples of 20, 30, 40 and 50 are drawn from the same parent 
population, yielding values of r, the sample correlation coefficient 
of 0.41, 0.60, 0.51, 0.48 respectively. Use these values of r to 
obtain a combined estimate of the population correlation 


coefficient. Test the hypothesis that the samples come from the 
same bivariate population. 


(a) Show that a partial correlation coefficient rig 34=0.5, ina 


sample of 20 sets of values from a quadrivariate normal | 


population, is significant at the 5% level.  (L.U., M.Sc. 1990) 


(b) From a random sample of 25 sets of values from a normal 
population, the calculated value of a partial correlation of 
order two is 0.48. Is this consistent with the hypothesis that 
the corresponding partial correlation in the population is 
zero, at 5% level of significance? (P.U., B.A/B.Sc. 1988) 


(a) Given n=20, "12,34 = 0.51, test by means of t-test and of F- 
test, the hypothesis that P12.94=0. 


(b) From samples of 24, 29, 33, 
coefficients of order 8 are fou 
0.42. Test their homogeneity, 


(a) Given that 


38, 42, the partial correlation 
nd to be 0.38, 0.54, 0.60, 0.50, 
(I.U., M.Sc. 1992) 


determine 121139 and Rj 93. 
(b) Test cach one o 


; f th otek , i 
sign'ficance at 5% ese correlation coefficiénts for 


level, (M.Sc., P.U., 1988, I.U., 1995) 


HEQ9, 


equality oy 


STATISTICAL INFERENCE IN REGRESSION & CORRELATION 


21.31 


21.32 


21.33 


21.34 


395 
(a) In a sample of 20 sets of values from a trivariate normal 
population, R} 93 was found to be 0.35. Show that this is not 
significant of correlation in the Population between X and 
the variables X, and X3. (P.U., B.A/B.Sc., 1990) 

(b) In a sample of 25 sets of values from a quadrivariate normal 
population, R, 95, was found to be 0.4. Test the hypothesis at 
the 0.05 level of significance that the multiple correlation 
coefficient in the population between X, and the variables Xo, 
X, and X, is zero. 

(a) Outline the procedure for performing a simple linear 
regression analysis. Set your results in an analysis of 
variance table and interpret them. À: 

(b) A gauge is to be calibrated using dead weight. If X represents 
the standard and Y, the gauge reading, perform a linear 
regression analysis based on the following results from 10 
observations. 


X = 230, Ÿ = 226, XX - Ž) (Y - Ď = 1532 

D(X-X)? = 1561, L(Y - ¥)? = 1539. 
Test Hy: B = 1, using a = 0.01. (P.U., M.Sc. 1971) 
(a) Write a note on Analysis of Variance for Regression. 


(b) The data given below represents the heights (X) and the 
weights (Y) of five men. We selected the heights in advance 
and then observed the weights of a random group of men 
having the selected heights. 

X: 60 62 65 70 72 
Y: 130 135 158 170 185 , 
Set out the analysis of variance for testizg the regression. 


im E r aoe 

Given that n = 38, X = 6, Ý = 42, D(X - x)? = 100, Ly - ¥) 

10,000, D(X — X) (Y-F) = -800; answer the following: 

(i) Determine Y = bg + b; X. . 

(ii) Partition (Y-Y)? into two parts, one associated ay = 
slope of the linear regression and the other associated w1 
the deviations about the regression. 

iii i = 0.05. 

Gii) Test Ho: B, = 0, using a = 0. oe 

(iv) Fo: the observation (X=8, Y=36), compute the adjuste 


value of Y. 
11.U., M.Sc. 1987) 


(v) Interpret bcth b; and By. 


À 


' 21.35 Given the data: 


96 INTRODUCTION TO ST 
3 s RRT ETON TO STATIETICN y 


Test the hypothesis Ho : B; = By = Oata 
analysis of variance technique. 


21.36 Given the data: 


0 
Obtain the least-squares estimates of the 


` multiple regression model Y; = By + BX); + BX; + e; and test 
the overall significance of the regression coefiicients, 


(I.U., M.Sc. 1991) 


So oo ote ote. ote aM 0% O o 
oe M0 Oe aCe MP %e Oe aC ksd o 


= 0.05 by applying the | 


parameters in the | 


22 


The Analysis of Covariance , 


22.1 INTRODUCTION 


Occasionally, when an experiment is being carried out, there is an 
uncontrollable variable, in addition to the main variable (variable of 
` -erest) in terms of which we wish to study the effects of classification 
or treatments. The uncontrollable variable that runs along with the 
nain variable and is suspected to influence its values, is called a 
concomitant variable (or covariate), and is denoted by X, while the main 
variable is represented by Y. The data thus consist of pairs of values of 
the two variables (X, Y), one pair for each unit. The statistical technique 


‘applied to analyse the effect of classification or treatment on the main 


variable Y, after removing the effect of the concomitant variable by 
regression method, is called Analysis of Covariance. Hence, the analysis 
of covariance is a mixture of the regression analysis and analysis of 
variance. Like analysis of variance, this technique partitions the total 
covariation of a bivariate sample, measured by the sum of the products of 
the deviations of the variables from their means into component parts 
associated with specific sources of variation. The covariance effects are 
thus sorted out and a test of the existence of regression is made. Finding 
evidence of any regression, its effect is removed before tests are made on 
the significance of the treatments. R.A. Fisher (1890-1962) has 
expressed that the analysis of covariance "combines the advantages and 
reconciles the requirement of the two widely applicable procedures known 
as regression and analysis of variance’. 

To illustrate how the analysis of variance is adjusted to provide fo: 


the removal of the effects of concomitant variable X, let us consider api 
simple linear regression model Y=a+PX+e. Then the apm 
regression line is Y = a + bX, where a and b are the least squares 
estimates nf a and B. As the regression line goes through the means, we 
have ¥ = a + bX. Subtracting, we get 


V_¥ = W(X -X) 
397 


398 INTRODUCTION To STATISTICA, y 
or y = bx Heop, 
This implies that 


e=y=y=y=by l 
Now the sum of the squares of the errors of estimate is 


de? = (y -bx)2 = Dy? -— 2b Day + b2yy2 
: Dxy 
= ze- a22) Dxy + =) Yr? (. 


Dx? b Š) 
(È xy) (2xy)2 (Ery)? 
= 2y’ = Jy2 _ xy 
De Za Se 
ey)? 
The term Zor » Which is actually the sum of Squares due to the 


mÈ. 
to remove the effect of the regression, the term EE Is to be Subtracteg. 
2x 
from the sum of Squares for the variable Y and the result 1s the corrected 
or adjusted sum of Squares for Y, The i 


22.2 ONE-WAY ANALYS 
PARTITIONING TH 
Suppose that a rando 
variables X and Y, is taken 
the paired observati 


iS OF COVARIANCE AND 
E SUM OF PRODUCTS 


following tab 
Groups or Treatments 


| 
| 


` E ANALYSIS OF COVARIANCE Pe 
TH Š 
HEA X., and Y., denote the means of the two variables in the jth 
As before, let Xj j t 


r treatment and X.. and Y.. be their grand means, 
group 0 


We construct the following identities: 
Xy-ž. = y- + By Koy, 
Yy- Y. = Oy 3 + (F-¥.), 


‘yo nave already shown that 


k n; = = en AD %4 ; (X., — X.,)2 
> (ky -23 = EE Oty Ky + Py, 


j=1i=1 

Similarly, r dis. a 
£ È (Yj -7.2 = ZÈ (Yy - Py)? + Znj Ëj- F.) 
Jot 


The identity for sum of products is RED : 
(Ky — X..) Wy - ¥.) = Xj-ž) (yg Yi) + Xj- Yj- s3 

: _- = _— _ _- 

i + (Xj — X,;) (Y-Y...) + (X.; - X..) (Yy- Y,) 
Summing over all pairs of observations, we get oe 

X Y -X,) (Yy-¥,) + E n; (&-X.) (¥,-¥..) 
YD (Xy-X..) (Yy-¥..) = ZÈ Ky) Yy? Zn y nyie al 
= The other two terms drop out as they are found s A aren i 
mmed over i andj. Thus the analysis of the sum of produ 

su ‘ 


ous to the 

paired observations are classified into k groups, 1s analog h 
3 lysi f tl ‘ p 5 x Y fi g 

Hence in problems involving k groups of paired observations of X 

’ y , 

and y there will be three anal ses, each with n-l degrees of freedom 

viz.: 


(i) analysis of sum of squares of X; 
p + Y. and 
(ii) analysis of sum of squares of Y; an : 
dY. 
(iii) analysis of sum of products of X an 


he Between or 
It is customary to denote the ina pana 
Treatment SS and the Within or Error SS by 


d 
Syy = Ty, F Exx en ai 
r tively. 
for X and Y respec 
Syy = Ty, + Eyy 


Similarly, we have 


Say = Tyy + Ezy for products. NO 
ing i igated has Ch, 
i ' being investiga 
Assuming that the factor 


irs of observations 
the-population from which the random sample of r nai 


ST. 
400 ATISTICAL THE OR, 
is taken, has the covariance }1,, we find that the expected val 


sum of products, i.e., Syy, is (n—1) hij. As the observations 
group may be regarded as a random sample of n; pairs, th 


INTRODUCTION To 


Ue of total 
in the jg 


: th eroun $e e © expecta 
value of the sum of products in the jth group is given by 
EIX (Xj - Xp) Cy- ¥y)] = Gy 1) by, so that 
“T fi j a 4 
E Re (Xj ~ X,;) Y; a Y.)] = 2 (nj =f Hil = (n-k) Hy) 


We also find that 


EIX n Xj -X..) Ëj- Y= &- p 
J 


Hence each of the terms S 


xy? Ezy 


and Tzy» when divided by n 


=l N-k 


and k-1 degrees of freedom respectively, gives an unbiased estimate of 
the population covariance if our assumption is correct. These results can 
be summarised in the following table: 


Source of Variation 


Treatment (Between 
groups) 


Error (Within) 
For example, the error 


2 i 
Eyy-E y/Exx with (n 


2 f 
and Syy-S)/ Sax with (n 


- -2Bression S5, ; p i 
Bression SS, ie., E/E xx havi 


ug 2 
Eyy-E Bg having n—=k— 


Sum of Squares & Products 


We may calculate an estimate of the r 
regression line and partition the sum of squares 


ng 1 


Co-efficient of 


Regression 


egression co-efficient, a 
due to Error and Total. 


sum of squares is partitioned into a sum 0 
squares due to regiession of E 


2 
xy/Exx With 1 degree of freedom ¢nd a sur 


of squares due to deviations about the regression line ol 


—k-1) degrees of freedom. The corresponding sums 
cf squares for the Total SS term are S 


es with 1 degree of freedom 


—2) degrees of freedom. 


` this purpose, we compare the 


degree of freedom with the term 


1 degree of freedom Ly applying the F- 


Oves to be significant, we constract another 
ares whi-h hay 


e been corrected for the effect 
the Significance of the difference 


g ANALYSIS OF COVARIANCE 
H 


401 
Ti woen group or treatment means. Since the adjusted sum of 
be 


es for the Between group or treatment means is obtained by ` 
Tello, we arrange the work in a tabular form.as below: i 
su 


Adjusted Sum of Squares Mean 


Source of 


Variation 


1 


yy I E € n-k-1 


f Ey, 7 J 
i . . P 2 


The quantity s? is an estimate of the variance of Y in the population 


n-h-1 "2B l Y ha 
k-1 , 


after correction for regression. Similarly, .s; is ancther estimate of the 
2 . 
: istic ` ! -distributi ith 
variance of Y. Hence the statistic F = E has an F-distribution w 


Se 


v,=k-1,.vg=n—-k-1 degrees'of freedom. We reject the hypothesis that 
F ; o . 
share is no significant difference between the treatment or ame: ae 
after adjusting for the effect of regression, if the computed value ° , 
exceeds F (k—1, n—k-1): To interpret the results properly, a table o 
‘adjusted treatment means is constructed. . Te 
` On the contrary, if the regression does not prove to be aL 
the conventional analysis of variance may be carried out ig 
values of the concomitant variable X. ; nni 
The multiple-comparisons test (e.g. the LSD test) oni i aan 
the adjusted treatment means—means which are ped soi ene 
factor and which are obtained by subtracting ye eae 
X,—X ; i as Y is, the corr a 
byx(Kj=X..) from the unadjusted means Y.;. That is, t 
are obtained from the equation 


= y _¥ p= an 
adj Ë; = Fj- by Xj-X) J F > phe adj ¥.», we find 
To compare two adjusted means such as ad) 1.1 d: means by. 


djuste 
the estimated variance of a difference between two adj 
the formula 


. (X. -2p 
a ee 


7 -732 nd 
2 j2, EZR] when meme 


ebr Exx 


l 
a 
i 


-For example, the sum of products. of the deviations of n° 


‘where. T.. = X2 Xij, a => Y;; and n = Znj i 
e Til i 


observations of X and Y in the jth group.’ 


403 . NTROPUCTION TO STATISTICAL yy 
; Š 


22.2.1. Alternative Computing Formulas for gs 
Products: We can simplify the computations of the sums of pr 


4 . a eee Pairs i 
observations (Y;, X;), i = 1, 2, „n from their means (X, Y) is given by 


FA 0) YY) = BOY, HY, AF + XH) ith | 
' -XXY -nž?-nž?+ n? 


TT’ 
= XXY- 
L 


(P.U., D.St. 1964, M.A., 1969) 


§ ý Pi ; i 
TER he a pars Z i The necessary computations are shown below: 


Applying this result to the sum of products YEA.) (Y-¥..), we et | 


kon TON" T 
Syn E EAR) YT DPX, 


j=li=1 


J : k 
For Within or Error sum of products, òn summing first over the 


observations in the jth group, we have 


5 =, TT; 5 13 6 
IP, -X) Y- = y [E a A L : 
ft J L W; 
. ¥ 5 47 1 


where T.j = YX; = nX.; and T; = > Yy = nY; are the sums of the |: 


5|1 7 7 


Next summing over the values of j, we get | 


ToT.: | 5 Dx = (3)? + (2)? +... + (1)? = 92 
Pay = BE GX.) Y-F) = Dex, y, — y k 
Ti K ye yr UU Fon | EEY? = (10)2 + (8)2 +... + (7)? = 1080 
The sum of products fer Between Treatments or Groups, i.t., Tx : 
can be obtained by subtraction The 


by replacing lar 


three sums of pr 


= 302 
X;Yy = (3) (10) + (2) (8) +. + (= 8 
oducts are unchanged by a change of the origins of 
and Y. 


J 
"a 9 = 338 
2 Tj = 64 + 225 + 4 


fA, Band C are the three methods of teaching | . ; - 
‘3 ing | r = 4254 
E spelling performance and iara | ? Te = 1269 + 2209 + 676 
e four students z 3 A, Ban 
as tabulated elin, gee ents allocated to systems 


i ; > Wie 
up the tae of analysis of covariance. at "y= (8x 37) + (75 x 47) + (7X 26) = meet 
wn from the table of analysis of covariance $% | > (Tj) (T) = a i 


Ias 


INTRODUCTION TO STATIS Tie 
CAL 


The sums of squares and products are computed: below. 


(110)2 
ig = 1008.33 


1. Correction ` 
factor 


2. Total SS ` 


ZLY, -C.F 
ty 


TT U CR 
= 1080-1008.33 | = 302 ~ 27 
= 71.67 (=S,,) | = 27 (=). 
T? T: T. 
3.Treatment SS 2-0k | ya 
. z r TO ra 
_4254 _ 1183 | 
=— g 1008.33 | = op 


= 55.17 (Ty) 
By Subtraction 


= 20.75 (=7,) 
By Subtraction, 
(=E,,) 


. Error SS By Subtraction 


(=E,,) 


Products | Regression 


Treatment 


Error 


_ Now We set up another table in which the sum: of squares have beet 
corrected for the effect of regression, 


| Regression | Adjusted Sum of Squares Mcan 


Computed 


F 


d.f. SS Square 


71.67-42.88 = 28.79 


16.50-5.21 = 11.29 


‘et 
at there is no difference in the of 
(2, 8). Choosing a=0.05, we see that ! | 


We reject our n 


ull hypothesi 
methods of ee, pothesis th 


& if FSF i; 


| 

E. 

H. 
Co-efñcient| | . 

Í 

| 


OVARIANCE 
THE ANALYSIS OF C 405 


E cain aeara Pos ; (2, 8) = 4.46, Hence we reject our null 
hypothesis and conclude that the later spelling performance corrected for 
differences m original spelling performance, varies significantly 
method to method. 

If we do not wish to consider the original Spelling performance, i.e, 
the X-values, then significance of differences in the means of later 
spelling performance would be tested by the F-statistic computed as 


55.17. 9° 27.58 
2 * 16.50 ~ 1.93 7 15:07. 


from 


F= 


This large value of F could be ascribed to the differences in the X- 
values. The regression line (error) for the method A would-be 


Ya = Ža + (Xi, —X,) 
37 
4 


+ 9.83 (Xi4 — 2) 


= 9.25 + 0.83 (X,4 — 2) 


In order to test the significance of the regression co-efficient, b, we 
compute the vaiue of the F-statistic by the formula 


apacia Regression MSE 
~ Deviation about Regression MSE 


22.3 TWO-WAY ANALYSIS OF COVARIANCE 
(with no interaction). . i 


Let a random sample of n pairs of observations Xij Yy) trom ni 
homogeneous population, be classified according to two Pees or: 
classificiation, say, Treatments and Blocks. Assuming one pair of 
observation per cell, we may arrange the data in a rectangular arrary 
having c-columns and r-rows so that n=rc. Let the columns represent 
Treatments and Rows, Blocks. Then the pair (Xj, Yy) ee A 
ith block and the jth treatment will appear in the ith row and the jth . 


- column. 


i i 4 and 
For partitioning the sums of squares for the two variables X 
Y, we have the identies 


Zj- = Z- + Bj- + Xj- A an i 
Yy a ¥..) = CA >d Xo) + (Yj = Y.) + Yy zi Y;. = Y; + Ya) 


+ X..), and 


. ve 
Squaring both sides and summing over all values, we a 
+ 


Ins 


406 INTRODUCTION TO STATISTic 
ema eee apa pe = y _F 23 = -5 
Z% Yy s$ Y,.)? TH cZ(Y;. Y..) +r z Y Y..)2 + 
>> Y; = Y,. ~y 


where all the symbols have their usual significance. 


AL THEop, 


j +Ë 


These results may be symbolically written as 

Sy. = By, + T,, + E,,, and | 

Syy = By + Ty + Eyy, 

-where B stands for Blocks and T for Treatments, etc, 


The partitioning of the sum of products is obtainined by multiply; | 

twe two-identities for the variables X and Y and by summing over a 
valu `s: Thus we get ; 

FE u) Yy- P = 0K. E.G P92 g 


i 


Source of 


Degrees 


Sum of Squares and Products 
pot [a [ae | 
T 


Regression 


Variation Co-efficient 


of freedom 


Treatments 


Blocks 


Error 


As before, we 1 
regression between 
regression co-efficient, 


| 
Is fo Oh 

and by comparing the Regression SS Eyy having 1 degree of 
5 ` xX i p | 
-tecdom wiih the Deviations Zx) | 


5 about Regrersion SS, (ie, Ey” E, i 


THE ANALYSIS OF COVARIANCE 407 


having (cr—c-r) degrees of freedom by using the F-statistic. If the 
regression proves to be significant, the sums of squares must be 
orrected of the effect of regression before Proceeding to test the 
e nificance of the effects due to the factors of classification. For 
pi mple, the corrected sum of squares for Treatments are found by 
Caliente a reduced form of the above table as shown below: 


Source of Regression Adjusted Sum of Squares | 


. Variation 


Treatment T: =(T+ E) Ey 


= | . y 


The Error and Treatment Mean Squares are 


9 vy 


E : — and sf = 
(r-1)(ec-1)-1 ae 


=, has an F- 


dis‘ribution with vy=c-l, vg = (r-2) (c-1) - 1 degrees agr 
Hence we reject the hypothesis that the treatment effects are zero, 1 
exceeds F,; (v,, vy). “the 

Ina similar way, we can build up a test for the effect of blocks or t 
second factor of classification. 


ae ae th data 
Example 22.2. Perform the analysis of covariance upon th 
Liven below: 


Treatments 


In 


‘ 


(ii) 


408 INTRODUCTION To STATIST; ; 
7. N P z C, 
Compute tests of significance and adjusted treatment Ee AL Theo, 
fang 


(LU,, Mg, 
necessary computa : 


For anlaysis of covariance, the 
below: j 


k 


HORS ate Biv 
i f 


S 
S, = 2_ (T..)2 , , 
xi PEx pot (5) + (15)2 4 + (i52 oir 


" 


= = 1060 — 900 = i60. 

Sy = Ey Y,- T = (17)? + (162 4 se (242 -EX 
= 3840 — 3600 = 240, and 

Syy = PIX; : TT.) = (469 + 536 + 726) — oe 

= 1731 — 1800 = ~69 


Treatment SS and Products: 


2 


FF sas 
Tags Z T 2706 
T r n 53990 = 902- 900 = 2 
T? | F 7 
Ty = E_T.) 11358 l 
Ne a S00 i aoe 3600 = 186, and 


I 


(TYT! 
To= j) ' 
xy p= PP.) 


n 


(82 x 45 ic 
= x 45) + (29 x 57) + (29 x 78) (90 x: 180) F. 
: 3 me 9 


= 5855 
3 ~ 1800 = 1785 — 1800 = --15, 


i 


THEA 


Gi) 


(iv) 


NALYSIS OF COVARIANCE 


409 
- Block SS and Products: 
3 B? 2 
sot (T..)? _ (15)? + (83)? + (42)2- (902 
= — — = a a A — Ą— 
Byy rc n 3 9 
_ 3078 
= 7 ~ 900 = 1026 — 900 = 196, 
12 z he oa 
i (EA | (69)? + (57)? + (64)? + (180)? 
Byy 7 L E n 3 9 
10926 | ' 
= J ~ 3600 = 3642 — 3600 = 42, and 
- BS Oc) 
Bay = » S i R 
= (15 x 68) + (33x 57) + (42 x54) A 


3 


m H 1800 = 1728 — 1800 = —72. 


Error 5S and Products are obtained by subtraction. 
Ey, = Suz — Tez — Byy = 160 — 2 - 126 = 32, 
E= S -T 


— B = 240 — 186 — 42 = 12, and 
XY JY vy Byy 240 — 186 
= —69 — (-15) — (-72) = 18. 
E,y = Sey E Try — By = —69 (—15) — (-72) 
Adjusted SS for Error: - 
; ec dee .y)2 (18)? 
Adjusted Dy2= Dy? — Qay”" 12-7327 


£x? 
= 12- 10,125 = 1.875. 


Adjusted SS for Treatment plus Error: 


: 2 
s Eok a 
Adjusted Xy? = Jy? - o = 198 -34 


yx 


= 198 -- 0.265 = 197.785. 


7 PNR saree table is 
Hence the Analysis of Covariance table is . 


41t : l “INTR 


ODUCTION To STATISTICA, Ty 
; ; Ro 
Sum of squares and Adjusted fop Covariat l 
nroducts 7 


IN 


Source of 


Variation 


‘Treatments adjusted 


Ths 5-percent F with d.f. vy= 
calculated vlaue of F is 156.69, which is highly. si 


evidence that real differences exist among the t-eq 
when adjusted ‘or the covariate X, 


To find adjusted means, we make the following computations: 


; a QXxy 18 
For Error, regtession co-erficient, bxy = 


. The gran. ean ‘is ae 10 | 
n 

Fence the ad) isted Means of treatments are computed below: | 
| Trztaent l 7 “Adjusted means i | 
| A 14.62 l) 
Ps | 
2 19.19 | 
la> 26.19 | 4 
Pees oS, | 
Example 22.3 Use the rsp. 3 : od Sto i 

7 = ~ -test wc} to) f ficance, to 
i ana.yse the corrected treat ve Se Tevalo] mee 


Men means of data in Example 22.2. 


° = | 
differen ze (LS.2) is given by the relation’ n 
. cs 


, oe least-signiyicay. 


; against adjustad error SS, . | 
e = Adjusted MSE, and 


r= ny er servati i t 
umber of observations In each treatment. 


g ANALYSIS OF COVARIANCE 
TH 


: 411 
Substituting these values, we get 


. 2(0.625) ` i 
LSD = (3.182) 4 —— , CS to.995,¢3) = 3.182) 


= (3.182) (0.645) = 2.05. . 


test the significance of the differences between the adjusted 
š pa t means, we arrange them in ascending order of magnitude and 
aeg ite under any pair of adjacent means that do not differ 
draw a 
ignificantly as 
i A B C 
14.62 -19.19 - 26,19 


It is observed that all the three adjusted means differ significantly 
as the difference between any pair exceeds the LSD. 


22.4 ANALYSIS OF COVARIANCE MODELS. 
i OWE-WAY CLASSIFICATION , ee 
Let Y;; denote the ith observation of the jth treatment Y, o 
et ri responding concomitant variable, on which Y;j has a lil ; 
cae e bp m regression co-efficient. Then the fixed effects mode 
regressio 
er the analysis of covariance is ERA A 
Few 


ith treatment, X.. is 
where p is the true mean effect, Tj is the effect of ap ina ertor 
the grand mean of the concomitant eo cee with zero mean 
p d independently dis 
assumed. to be normally an 
and common variance. 


Yy = p + Tj + BUX; -X..) + ey, 


; 2 
. inimizing Q = 2 pa ey 
The least-squares estimates are obtained by mini Lin 


nn 


subject to the restriction that Dt; = 0. Then a 
Q= LL, -n-y- py- 
: by ting to zero, we have 
Differentiating Q w.r.t. pı and equating ; 
pa, - Ka 
Mojs -25E j-n- 47b 
ðu ty 
aa f=, ing to zero, we get 
Again differentiating Q w.r.t. T and apan g ; 


—=(0= 


> oo 
dQ -2 5 Yy- BG 
ð : 


IN 


We wish to test the hypothesis tha 
_true effects of thek treatments, Le. 


` INTRODUCT'ON 10 STATI S 
o A Ty 
Simplifying, we get , 


t = ľ; = Y.. m B; ~X..) 
which shows that each treatment mean’ should be adjusteq fot th 
regression on X, : 


Pathe) og T Venue. _ 
Similarly, 7 =0=-2 2 2 (Xj — Xi) Hy H-T, B; -Ē.)) 
Substituting for u and T and simplifying, we have 

a 2 (Xij-X..) (Y ;-F..) - + X-X.) F -F..) 
ZÈ (X;-X..)2- FE (X-X) 
tJ ý _ 
= SEQ pepe 
Ly A 
_ Residual or Error cross-product of XY _ Ey 
~ Residual or Error sum of Squares of X — Be. 
These results can be presented in an s.slysis of covariance table as 
2 


tefure, After adjustment, the Error S8 = Ey- 5%, and 


E-E? JE 
Error MS = -2 — “xy / Exx 


"oe ANALYSIS OF COVARIANCE 
THE 


~ Ug 


t there are no differences among the | 
» Hy: T = 0. : | 


In other words, our hypothesis is the comparison of 


O Yeap 5+ Bx -Asa 


: ijp and 
(ii) Y; = pl + Bx; -ž..) + ej. 


ee, estimate the Treatment SS as the difference between the 
two Error Sums of Squares for the two models, From (i), we have 


2 
E 
Error SS = ye", di fiers (ii), we have 
xx ` 


ir pn s 
Error Sg =- Sy, — Ty + B,,)2 


Sa (Ty + y= oe a that 


< xz t E . 
Treatment SS = Error SS (iis = Error Sg a 


413 
Treatment MS 


Hence F = Error MS has an F-distribution with v,=k-1, 


e—-1 degrees of freedom when our hypothesis js true, 
anm: 

if F proves to be significant, we can apply the multiple comparisons 
. l look for significant „comparisons. Suppose there are two 
test is then the difference between them is given by 
trea ’. 5 Ps Š = 

A pes Ta = Y. — Y. = BX, Aah and 

(X., = za 
Baa E 


A A qn 
Var (ty = T2) = 3, L E 


xx 


Hence all pairs will give different variances, ; l 
In case of two-way analysis of covariance, the fixed effects model is 


i= 1,2, 
Mra 


here the letters have their usual significance. The least-squares 
wher 


Yj = u+ B; + 7 + BX; - X..) + Eijs i 


Bi id wed i 
inimizi i; Subject to the restrictions 
estimates are found by minimizing 2 > ei; subj 


that 2B; = 0 = Fy. 
L 


J : 
The model may be written as 
Yj- By -X.) = + Bet y + ey 


or Zi; = pl + B; + yY + ei, where Zij = Yi BX X.) 
y 


f eee tandard 
-Tie terms on the right hand side of this equation give the s 


i : he left 
i i hile the terms on t 
y = lysis of variance, w : 
hdd ate to ed ieee rom its linear regression on X;,, Thus 
hand side are the deviations of Y;; from i ea A o 
the analysis of covariance is a mixture of the regre i 
analysis of variance. 


: riance. The 

22.4.1. Assumptions Made in Analysis by ay to those 

assumptions made in the analysis of Sedans as the analysis 

made in the analysis of variance and ee eee However, the 
of covariance is.a mixture of these two tec 


he covariance analysis are given 
"ui c i 
- sumptions necessary for carrying out the 


as follows: . eneous 
ies th homog' 
(i) The Populations are normally distributed wi 
variances, l 


. dent. 
Gi) The Samples are random and indepen t zero 
u is no A 
(iii) The regression is linear and the slope 


‘ . ts are additive. 
(iv) The treatment, block and regression effects a 


414 ‘ -INTRODUCTION TOS FATISTICAL THEORy 
(v) The concomitant variable is fixed, implying: that it is not being 


x - affected by the treatments, 


(vi) 


The residuals ĉj are normally and independently distributey 
with zero mean and a common variance, S 


° 22.4.2. Uses of Covariance Analysis. The main USeS of the 
technique of covariance analysis are briefly enumerated below: 


(i) 
(ii) 
(iii) 
(iv) 


(v) 


22.1 
22.2 


22.3 


22.4 


It increases the Precision of the experiment by removing Certain 
environmental effects that cannot be controlled by the 


experimental design. Bs 


` The covariance adjustments remove the bias due to regression, 


The nature of the bias is the term Pixs =Ë.. 


As it partitions the total covariance or the sum of Cross-products 
into component parts, it is therefore used for testing a regression 
co-efficient or the homogeneity of k linear regression co- 
efficients. i 


t assists in analysing the treatment. effects properly when a 


It can be profitably used to estimate missing observations uy 
sefting Y=0 for each missing value ang introducing dummy 
covariate X in such a way that X=1 corresponds to the missing 
values and X=0 to al] others, i 


r 


EXERCISES 


Define a covariance analysis, Discuss the appropriate model and 
the assumptions involved, >  (P.U., B.A Hons, Part III, 1969) 


Describe the Purpose of an analysis of covariance and outline the 


‘main stages in its calculation, 


means are adjusted incavarianag analysis? 
. ) y EA 


© Jor =: two-way clagsifigation pib. > 
r blocks, o¢j in* cate teste that yet, 
i U., M.Sc. 1972. 


He ANALYSI 


ana:ysis &pnropriate? Write dowi ` 


S OF COVARIANCE 


415 
Give the analysis of variance fop a two-way Classification 
discussing the appropriate model and the assumpti 1S involved ’ 


If a measurement of a concomitant Variable is avail 


observation, explain how the’ analysis Would be modified 
discussing any further assumptions, ; 


(a) What are the assumptions behind a covariance analysis? 


(b) In the process of analyzing data by a covariance analysis, 
what tests of significance are made? 


(c) Explain the interpretation or inferences and the course of 
action indicated when each of the above test is signficant; 
when each is nonsignificant. : 


7 8 Carry out an analysis of covariance with the data given below: 


in the 
22.9 An experiment on gain in weight of rats resulted as shown in 


, the gain in weight. 
` table, X indicates the quantity of food and Y, the gain in weig 


z ; h 
Did the four rations A, B, C, D produce different en et e 
_ Tats? Are the gains affected materially by quantity of food? 


‘e mean 
‘on and compare me 
T. Test the significance of the Aare i 
Values (if necessary, adjusted) of X for eac 


416 j INTRODUCTION To statis 


osi ae | a | 


22.11 


22.12 


ș ANALYSIS OF COVARIANCE 417 
qHE ANA 

te the anlaysis, i i aa 

Comple ysis, making appropriate tests to indicate 


the reason for your conclusions. 


(P.U., B.Sc. Hons. Part I, 1972; M.Sc. 1970) 
Each of 4 blocks was divided into 3 Plots, an 
treatmnets A, B, C were distributed at random a 
of each block. The rows correspond to blocks a 
rain and straw are denoted by X and Y respecti i 
a covariation between the yields of grain nd eeu 
whether the yield of straw after correction for yield of gr in 
varies significantly with treatment. _ 


d 3 different 
mong the plots 
nd the yields of 


92.18 


Treavments 


(P.U., M.Sc., 19 
" A . 1 ` a n) 
iang groups of 5 students each were given an initial test 
a -o and the scores obtained. are given as X. | > 
Subsequent test, the scores obtained were given as Y Ae 


the analysis of covari 
inalys lance to test the signi iffer 
between subsequent scores, cea a iiia 


22.14 What are the basic principles of anlaysis of covariance? Set up 
analysis of covariance table, indicating the nature of questions 
and manner in which those questions can be answered for the 
data on two variables X and Y presented in the table: 


Treatmnents 


1 


(a) It is desir 
ii jmi oe the IQ’s of students in three schools. 1s 
average. semen: Q is related to the students’ grade-point n 


shoùld their IQ S of 12 students are taken from each school, 
averages by an values be adjusted for their grade-point 
. .of variance of oe of covariance, or should an analysis 
` eir IQ’s be performed directly? 
(P.U., M.Sc. 1972) 


(P.U., M.A., 196v; 


72.15 Foran experiment, the results are summarised by the following 
sum of squares and products: - 


7472.6 


116020.3 3598.05 


682.20 


ce (free ae ree 
(a) Based on the error sum of squares and Products, ; 


regression of Y on X significant at œ= 0.05? 


418 INTRODUCTION TO STATISTICAL Theon 
a ORv 


S the 


(b) Are the differences among the treatment means for y 
` for variation attributed to X, significant at @=0.05? 


(c) What conclusions do you draw from the above data about t 
effects of treatments? Make any additional computations iy 
you consider necessary. at 


CE À 
O 00 90 0,0 eo 


Po ee of of ote ove om (O 
Mo 9,0 0,0 Oe oe 0t © re 


adj USteq - 


| Experimental Designs. 


93,1 INTRODUCTION 


By en experimental design, we mean a plan used to collect he data 
relevant to the problem under study in such a way as to provide a basis 
for valid and objective inference about the stated problem. The piar. 
usually consists of the selection of treatments whose effects are to be 
studied, the ‘specification of the experimenta! layouts, the assign-ient of 
treatments to the experimental units and the collection of observations 
for analysis. All these steps are accomplished before any experiment is 
performed. 

An experiment is planned to 

(i) get maximum info. ration for minimum expenditure in the 

minimum possible time; 

(ii) avoid systematic errors; 

(iii) _ evaluate the outcomes critically ard logically; 

iv) ignore spurious effects, if any. 


The following consider2*ions go into the planning of an experiment: 
(i) What is the experiment intended to do? 

fii) What is tne nature of the treatments or depende 
and how are they to be estimated? 
‘low is the independent variaol 
*reatments or dependent variables? 


Are the factors to be held constant oF 
whether the variation is quantitative or qua 


n* variab`cs 


(iii) e likely to affect the 


varied? if varied, 
litative? 
sieht taie Fis 
T ® answers to these questions enable the aia more 
esis precisely and to plan his experimental procedu”? 
Way, po 3 


(iv) 


ective 


419 


420 INTRODUCTION TO ST 


ATISTICAL THER 
————_—————————e . 
There are two types of designs; systematic and random q > 


; l t designs. 
the analysis of variance techniques are suitable to randoinizeg but 


i ; esi 
only. The basic randomized designs are (i) Completely Randomige 
(ii) Randomized Complete Blocks, and (iii) Latin Squares, whic i 
discuss in the sections that follow. e 


23.2 BASIC PRINCIPLES OF EXPERIMENTAL DESIGNS 


The basic principles of experimental designs ar: rando 
replication and local control. These principles make a vali 
significance possible. Each of them is described briefly in 

> subsections. 


mization, 
d test of 


23.2.1. Randomizatica. The first principle of an experimenta] 
design is randomization, which is a random process of assigning 
treatments to the experimental units. The random process implies that 
every possible allotrient of treatments has the same probability. An 
experimental unit is the smallest division of the experimental materia) 
and a treatment means an experimertal condition whose effect is to be 
mearured and compared. The purpose of randomization is to remove bias 
and other sources of extraneous variation, which are not controllable. 
Another advantage of randomization (accompanied by replication) is that 
it forms the hasis of any valid statistical test. Hence the treatments must 
be assigned at random to the experimental units. Randomization is 
usually done by drawing numbered cards from a well-shuffled pack of 


cards, or by drawing numbered balls from a well-shaken container or by | 


using tables of random numbers. 


23.2.2. Replication. The second principle of an experimental 
design is replication; which is a repetition of the basic experiment. In 
other words, it is a complete run for all the treatments to be tested in the 
- experiment. In all experiments, 
the fact that the ex 
in agricultural exp 
variation can be r 
therefore perform 
basic experiment. 
number, the shape 
the experimental m 


emoved by using a number of experimental units. We 

the experiment more than once, ie, we repeat the 

An indiviaual repetition is called a replicate. - 

and the size of replicates depend upon the nature : 

aterial. A replication is used 

(i) to secure more accurate 
term which represents t 
if the same treatments 
experimental units; 


; ' a 
estimate of the experimental error, 
he differences that would be eu 
were applied several times to the sa” 


the following. 


some variation is introduced because of 
perimental units such as individuals or plots of land | 


i ; p | | *8gned to experimental units completely at ran 
eriments, cannot be physically identical. This type 0 ‘ 


WTAL DESIGNS 


xpERIME 421 
E to decrease the experimental error and thereb = 


) y to increa: 
ji) n rhic iabi ik 
( precision, which 1S a measure of the Variability of the 
exierimentai error; and 

to obtain more precise estimate of the Mean 
o2 
E where n denotes the number of 


(ii) effect of a 


A 2 
treatment, sınce Os = 


replications. 


23.2.3. Local Control. It has been observed that all extraneous 
of variation are not removed by randomization and replication, 
sources ssitates a refinement in the experimental technique. In other 
= ge need to choose a design in such e. manner that all extraneous 
pa j variation are brought under control. For this purpose, we 
Ta of local control, a term referring to the mount of balancing, 
ah anc. grouping of the experimental units. Balancing ineans that 
iara should he assigned to the experimental units in such a 
way that the result is a balanced arrangement of the Loreena 
Blocking means that like experimental units should be collected together 
to form a relatively homogeneous group. A block is also a replicate. The 
main purpose of the principle of local control is to increase the woe 
of an experimental design by decreasing the experimental error. a 
point to remember here is that the term local control — he 
confused with the word control. The word control — 2 
Resign is used for a treatment which doss not receive any sae : 
we need to find cur the effectiveness of other treatments throug 
‘omparis on. 


` 


13.3 THE COMPLETELY RANDOMIZED DESIGN 


: ‘ch is the simplest type of 

A completely randomized (CR) design, Yee sl = aa are 

the basic designs, may be defined as a design in whic Jei, that is the 
asign is completely 


"andomization is done without any restrictions. Th er op anite Bel 


flevible, ie, any mimber of treatments and any oo pët treatment 
treatment may he ised. Moreover, tke number of “se conser to be 
Need not be equal. A completely randomized tel units are 
most useful in situations where (i) the m cud as laboratory 
homogeneous, (ii) the experiments are small iay to be destroyed 
periments, and (iii) some experimental units are 1 


or to vai] to respond. is the 


v'hich 
have F 
eintal 


ut of an experiment 
erımental units, 


1, Suppose We, 
j o n experim 


hetre l 


3.3.1. Experimental Layout. The ae 
Placement of the treatments on the exP 
material. © 
| is divided int 


su ertain to time, spare or type of 
athents ang the experimental materia 


‘number of times, then r} = rg =... = r, = rand Dr; = rk « is 


422 INTRODUCTION To STATISTICAL : 
He 


units, We shall then assign the k-freatments at , andom to th 

j ma ; e 

experimental units in such a way that the aami. G=1, 2, Bp ’ 

applied r; times, with Èr; = n. When each treatment is applied the . 4 

e 
; s 

each treatment is applied (or replicated) an equal number of times, va 


An example of the experimental layout for a completely ran doing 


design (CR) using four treatments A, B, C and D, each repeateg 3 times | 


is given below: 


The result or response of a treatment which may be a real yield, the 


gain in weight, the ability, etc., is generally called yield and is | 


represented by the letter Y, 


23.3.2. Statistical Model and Analysis. Let Y; denote the yield 
of the ith observation on treatment). Then Yj may be represented by the 
linear model 


f= 1,2,.,k 
Tyr hey + ey, lr 


where p represents the true mean effect, T; represents the effects of 


n 


treatment j and ejj denotes the random error, normally and | 
‘adependently distributed with mean zero and variance 2, The nul } 


hypothesis in the case of fixed effects model, may then be stated as 
Ho: T =0 for all j = l2 cay Bi 


and the alternative hypothesis as H} : some Tj # 0. These hypotheses atè 
enuivelent to the following set of hypotheses 


Hy: u, = Hg =... = up, and 
H, : Not all Hl are equal. 
For analysis of variance, 


eve the deviations from |), 


Statistica] Analysis, For conve 
from a CR desig 


the least-squares estimates are obtained bY ! 
mrimizing 2X ij» Subject to the restriction that Dt; = 0. (Remember § ] 
J 


ined 
nience, the data of yields obtain? 
n can be assembled as follows for statistical analysis: 


pERIMENTAL DESIGNS 


Observation 


1 


Total 


Means 
This is equivalent to k independent random samples ‘and hence 
logous to one-way classification. Thus the partitioning of the total 

a squares into components for treatments and error would ‘be as 


usual, i.e., 
kor = k me Da2 S 
E Ly - Po? = Dy Py F+ E Vy Fy 
J=li=1 j=1 j=li=l 


or Total SS = Treatment SS + Error SS 
and the ANOVA-Table for a CR design would be 


Square 
k-1 


Treatments (T rj) 


By subtraction 


Error (e;;) n-k 


A i as.usual. 
The computations of the sums of squares are carried out 


is true and 
Now, if our hypothesis that all treatment eae eed and 
if the assumption of normality, independence, eee ce hold, then 
omogeneous variances underlying the analysis of var 
the ratio 


freedom. 
. = -k degrees of 
has an F-distribution with vy=k-1,⁄27? 


424 ` INTRODUCT; 


l 
Per aint STATISTICA» 
The hypothesis is rejected at the a level of signif Hedy 
Cance 


F2 Fo (vy, v9). ; 
If the hypothesis is rejected and the difference betw 


effects proves to be significant, then the difference w treaty 
treatment meens can be tested by etween 


t= š 
sa Ept 
e r r! 


where the statistic t has Student’s t-distribution with (n 
freedom. 


re) he 


To answer quescions of the type which treatment is th 
treatment is the second best, we use multiple comparisons tests 


23.3.3. Advantages and Disadvanta 
, ges. The advanta 
disadvantages of the completely randomized designs are given l a 


(i) The design is very simple and is easily laid out. 
(ii) It has the simplest statistical analysis. 


iii ‘ovi i 
Gii) It srovides the maximum number of degrees of freedom for 
errcr sum of squares, 


replications may be used 


(v) 


TA) degrees y 


e best ] which 


(iv) The design is flexible, i.e., any number of treatments and of | 


The design is applicable only to a small number of treatments. | 


(vi rel 18 Hoes hii ; | 

? koms n PORY of entering the whole of the variation | 

sanis a experimental units into the experimental error as 
randomization is not restricted in any direction. 


EXPERIMENTAL DESIGNS 


This experiment may be described by the following statistical m 


Yy 5 b+ T+ ej 


where Yy represents the Aap of the ith observation on treatment 
variety) j, aad Tj denotes the effect of the treatment j, etc, Here j=1, 2 
and 3 andi = 1, 2, 3, 4 for each treatment. To analyse these results, we 
proceed as below: 

| (i) We formulate our null hypothesis as 


425 
doel: 


Ho: 1j = 0 for allj = 1,2 and 3, i.e. there is no difference among 
the yielding capabilities. of the three varieties of potatoes: 


and the alternative hypothesis as 


H, : Not all t; are equal. 
(ii) | We choose the significance level at œ = 0.05 


(iii) The test-statistic to be used is 


MS for Treatments 


ps MS for Error 


which, if Hp is true, has an F-distribution with v} = k — 1 and 
Ug = n — k degrees of freedom. 


(iv) The necessary computations are carried out as below: 


18 (324) 
28 (784) 
20 (400) | 17 (289) 


17 (289) | 21 (441) 
67 


4489 | 18941 T 


? 


426 INTRODUCTION To STATISTICA 
. 


r? T? . HE OR, exPERIMENTAL DESIGNS 427 ie | 
é jo" 18941 In actual field or laboratory, the treatments | 
= } —-— = = ; ae Sec wo | 
Treatment SS 2 r on 4 4680.75 = 54 59 hz experimental units in positions corresponding to rilara plots or 
2 . a 7 $ Shown i 
The sum of squares for experimental error Jayout, that is the treatments refer to the actual locations in the aia 


laboratory. f 

23.4.2. Statistical Model and Analysis. As each observation i 
RCB design is classified by the block to which it belongs ia Ne 
treatment it receives, therefore Yj; represents the Shean i 
corresponding to block i (i = 1, 2, ..., r) and treatment j (j = 1, 2 = 
The linear statistical model for this design would be ie cs 


À i is obtaj 
subtraction. The analysis of variance table is taineg by 


Source of d.f. Sum of Mean Co Ja 
. . m 
ens | * [ens | os e 
Treatments (t;) | 2 54.50 27.25 
Error (¢;;) ‘| 217.75 24.19 


(v) The critical region is F > Fo.o5 (2, 9) = 4.26 


Yj = H +B; +t + ey i 


where B; represents block effect and eij are assumed to be normally and 
independently distributed with mean zero and variance 02, and where 
blocks and treatments are orthogonal. The least-squares estimates of the 
parameters HH, B; and Tj are obtained by minimizing the quantity 


(vi) Conclusion. Since the computed value of F 
: = 1.13 does not fal 2 subj icti 
Q= > ey subject to the restriction that 2B; = Oand > Ņ=0. ; i 


in pie ag region, so we accept our null hypothesis and may 
conclude that there i iffer S p ; 
capabilities of the “a ee cena: among the yielding The analysis of a randomized complete block experiment consists of 
ietie , ‘ 
o: potatoes, a twc-way analysis of variance test where the hypotheses (i) Hy: t= 0 


for all j, i.e., all treatment effects are zero, and (ii) Ho : B; = 0 for all i, 


ie., all block effects are zero, are tested. To facilitate computations, the 
data may be arranged in a Table as shown below: 


relatively homogeneous (ii 
li) each block i | 
treatments, Le., it coiistit a i ne a ie 


utes a replication of t iii 
js i i reatments, and (iii) the | 3 23 w d a 2 | 
eatments are assigned at random to the experimental nits within each | ` : : = 3 J AICS 


block, which means <he r 


; esign, usin 
blocks might be as follows; i 


BLOCK | 


- i > 
m mm i y o” are 

i opulation variance 
The unbiased estimates of the common pop ao? = 


BLOCK || 

ili obtained by partitioning the total irs me the ‘etal sar. 
. ase “igs ition 2 . 

BLOCK ji) i of a two-way table. Accordingly, the ae and error would be 


Squares into components for treatments, 


428 INTRODUCTION To STATISTICAL $ 
7 j E E r THEOp; 
LEE er LUL ETR 
i=ļj=1 Jel bæi 

r k 


aa jY. Ye | 


i.e., Total SS = Treatment SS + Block SS + Error SS, 
and the ANOVA-Table for a RCB design would be 


-Source of 
Variation 


Treatments (Ty rz (¥,-Y..)2 =SSTr 
J ; 


Blocks (B;) kÈ (Y,.-¥..)? = SSB 
L r 


Error (e;;) (k-1)(r-1) | By subtraction=SSE 


XÈ (¥y-¥..)? = SST 


The computations of the sums of squares are carried out as usual, 


Now, if our hypothesis Hy: YG = 0 for allj, i.e., all treatment effects are | 


zero, is true, then the Statistic 


2 
fs 5: _ MS for treatments 
=— = — 0r treatments 

s? MS for errors 


has as F-distribution with v; = k-1 and Vo = (k — 1) (r — 1) degrees of 


freedom, The hypothesis Hy: YG = 0 for all j, is rejected at the level of 
Significance, when F2 Fg (v,, Vo). 


Furthermore if our h i | i 
ypothesis H,': B. = ‘all i, ie., all block 
effects are zero, is true, the statistics : eee ents 
$ 2 
F, = So o MS for blocks 
s? MS for errors 


has an F-distribvtion with vi = r-l and v 
2 = 


reedom. is H: ` 
m. The hypothesis Ho: B; =°0 for all į is rejecced at the o level of 


significa i ceeds "fi i 
& nce if F3 exceeds Ñ: (v, v2). It is important to point out that! 


F. press so aTa 
2 Prcves to be Insignificént i 
» Whis e ar ina 
among blocks, we have a bad design en —— 


Usually, 


(k-1) (r—1) degrees of 


the blocks ar i i 
differences between i S are chosen to be different by making the 


L DESIGNS 


ENTA 
jae nies i a 
P n to test the hypothesis that there are no differences among th 
re z 3 i 
plocks. E Fes 
phe diffe 'en:> between two treatment means selected at random 
ante tested by computing the least-significant-difference (LSD) as 
LSD = tajo Z error mean square) 


e r is the number of blocks (replications) and the value of t s is 
> q, 


her , "P , 
nt from the table of the Student’s t-distribution for error degrees of 
freedom. . 
23.4.3. Advantages and Disadvantages, The important 


advantages of the randomized complete block design are as follows; 

(i) The source of extraneous variation is controlled by grouping the 
experimental material and hence the estimate of the 
experimental error is decreased. 

(ii) The design is flexible, i.e., any number (but not less than 2) of 
replications may be run and any number of treatments may be 
tested. 

(iii) The experiment can be set up easily. 

(iv) The statistical analysis is simple and straight forward. 

w) It is easy to adjust for the missing observations. 

However, the design suffers from the following two disadvantages: 

(i) It controls variability only in one direction. 

(ii) Itis not a suitable design when the number of treatments is very 
large or when the blocks are not homogeneous. , 

Example 23.2. Four var’ :cies of wheat were tried in a randomized 
complete block design in four replications. Yield in kilogram per plot is 
shown in the table given below. Test the hypothesis that there is no 
difference in the means of four varieties. 


(P.U., B.A./B.Sc. 1971) 


() We state our null hypothesis as 
Ho:Tj=0 for all j=1,2,3,4, i.e., there is 
of the four varieties of wheat; and the a 


H, : Not all the four means are equal. 
a = 0.05. 


no difference in the means 
Iternative hypothesis 1s 


We choose the significance level at 
The test-statistic to be used is 


: The sum of scuares for experimenta’ 


430 INTRODUCTION TO STATISTICAL.~ 
2 thoy 
p 5; _ MS for treatments 
SS ee 
sf MS for errors 
e 


which, if Ho is true, has an F-distribution with v 


1=ka 1l ang 
Ve = (k — 1) (r — 1) degrees of freedom. 


(iv) The necessary computations are shown below: 


y 


Varieties 


1) 
2 (4) 3(9 39 1(1) 
4 (16) 6(36) 6(36) 2(4) 
1(1) 4(16) 2(4) 3(9) 


Total SS = 2> y= n ? Where n = rk 


(49)2 
“7g = 191- 150.06 = 40.94 


I 
m 
ie} 
= 

l 


a e 
FTO 


Treatment SS 


_ 879 _{49)2 
= -4 ~ie = 169.75 ~ 150.06 = 19.69 


2 2 

Replicate . SS= > Ze Th 
L n?’ 

649 _ (492 


4 ~ 16 = 162.25 — 150.06 = 12.19 


error is obtained by subtraction. 


EXPERIMENTAL DESIGNS 


These results are given in the following ANOVA-Table. 


Source of d.f. Sum of Mean 
Variation Squares | Square 


3 19.69 
3 12.19 
9 9.06 


(v) The regions of rejection are Fy, Fy > Fo.053 (8, 9) = 3.86. 


T reatments 


Replicates 


Error 


(vi) Conclusion. Since the computed value of F, falls in the critical 
region, we therefore reject our null hypothesis and conclude that 
the means of the four varieties of wheat are significantly 
different. ‘ 

Furthermore, since F-statistic indicates rejection of the null 
hypothesis, we therefore apply the LSD test to find out which means 
differ significantly 


\ 
eo rw 


Now LSD = to 5,0) X \ {> 


= 2.264 |2 ae = 2.26 x 0.71 = 1.60. 


Arranging tke means of the four varieties of wheat in ascending 
order of magnitude and drawing a line under any subset -of adjacent 
means that do not differ significantly, we get 


Vy Vi V3 Ve 
1.75 225 | 3.75 4.50 


Example 23.3. Following is the pian of a field T pei 
four varieties A, B, C and D of wheat in each of 5 blocks. The p 
M pounds are also indicated therein. 
BLOCK | BLOCK II BLOCK III BLOCK IV 


BLOCK V 


. below: 


INTRODUCTION TO STATISTICAL yy 


432 Eon, 


Perform an analysis of variance and state your conclusions. 


l P.U., M.A. 1960) 
For statistical analysis, we organise the data as shown in the tabl 
` e 


Treatments (Varieties of wheat) 


I 


II 


32.3 33.3 30.8 29.3 
34.0 32.0 ` 34.3 26.0 
35.3 29.8 
32.3 28.0 
35.8 28.8 


28.38 


33.70 


Computations for ANOVA: 
P 7? 
Total SS = —-— 
i ae Y n 


2 
= (82.3)? + (34.0)2 +... + (28.8)2 — = 


= 21725.22 — 21543.05 = 182.17 


$ 
Treatment SS = F ata 
A J r n 


~ (172.1)? +... + (141.9)? (656.4)2 
= >E t e + (141.9)? _ (656.4)? 
5 


20 
= 21677.50 — 21543.05 = 134.45 
BT 
Block SS = = t 
$ k n’ 
= (125.7)? + ... + (135.6)2 _ (656.4)? 
4 20 
= 21564.51 — 21543.05 = 21.46 
Error SS = Total SS — (Treatment SS + Block SS) 


= 182.17 — (134.45 + 21.46) = 26.26 


* cypeR 3 
gE e the ANOVA-Table is 


sil NS 
NTAL DESIG 
M- 433 


Henc 
Sovrce of 
e” Variation 


Sum of 
Squares 


Treatments (T;) 


Blocks (B;) 


Error (e;j) 


rotest Fo: ty = T2 = T3 = Ty against H, : Not all T; are equal, we find 
that F,=20.47, which exceeds Fp 95; (3, 12) = 3.49, We therefore reject 
the null hypothesis and conclude that there is a considerable jifference 
among the means of the four varieties of wheat, 


23.4.4. Randomized Complete Block Design with 
kəplications Within Blocks. Let us assume that each treatment is 
replicated n times within each block, so that each block contains nk 
plots. The exneriment becomes analogous to two-way classification with 
‘ech cell containing n observetions. This leads to.a new statistical model 
‘a the form 
a oe ere 
J=1,2,..,k 
P= 1,2 


= 1,2) n 


Yin = [t+ Bi + t+ (Bt), + ey] 


where (Rt);; denotes the interaction between the ith block and the jth 


r 


. k . 
| ‘Teatment with the restrictions È (By = 0 = Y (Bt)y; and other 


i=l j=: 
Yr.nols have their usual significance. 


‘ Partitioning of the total sum of squares in this case is given 
` fo. lows: 


E b'a 
Lois 7 = 5 rreatment SS 
EE YP. =m Ey ~ Fd "rea 


= z g 
+ kn È} (Y;..-— y...)? — Blocs S 
L 


Fee a ees ctio SS 
+ nOO(Yy Yj. Vj + Y.) > Intera 
tJ 


y.. )2 


— Frror £5 


434 


The total number of degrees of freedom is partitioned as below. 


Source of Variation 


Degrees of freedom 


Treatments (T;) 
Blocks (B;) 


(k- 1) (r~1) 


By subtraction 


Interaction (Bt);; 


Error (ejj) 


The rect of the computations and analysis are carried out i 


n the 
usual manner. 


23.4.5. Missing Observations in RCB Design. Sometimes it s 
happens that some experimental units or observations are missing, Fr. 
example, plants may be destroyed or uprooted, records may be lost, 
flasks may break, animals may die or human beings may not cooperate, 
These omissions, being beyond the control of the experimenter, destroy 
the orthogonality and the balance of the design. In such a case, it is 
possible to analyse the non-orthogonal data as a multiple regression, but 
this procedure is difficult. The simple method, as suggested by Yates, is 
to estimate the value of the missing observation so that a complete set of 

“data is obtained for carrying out the usual analysis of variance, The 
missing observation is estimated by minimizing the error sum of 
squares. It has been further shown that the validity of the analysis of 
variance of the augmented data is not disturbed if the proportion of the 
missing observations is not large and if the error degrees of freedom alt 
decreased by the number of missing observations computed. 


__ Suppose for convenience, the observation for treatment 1 in block! | 
in a RCB design with-k treatments and r block, is missing and x is put" | 


place of this missing observation. The data may be arranged as below: 


Treatment 
Total 


INTRODUCTION To STATISTICAL y 
o 


peRIMENTAL DESI TA 435 
p ming the analysis of variance, n ee 
r bi 


‘ perfo 


f Where : 2 


4 


P 


GNS 


we get the following sums of 


sgua" es: c v 
Sum of Squares 


gource of 
Variation 


Treatments 


Blocks 


Error 


Thus the error sum of squares is obtained as 


(Tx)? (T.,+x)? (By.+x)? 
-ir r g t terms which do not 
involve x. To minimize SSE, we differentiate it w.r.t. x, equate it to zero 


and solve for x. We therefore get 


. SSE = x? + 


- SSE) A(T. +x) 2(T.,+x) 2(B1.+x) 
— = 0 = 2x + |1 - — - 
: Ox kr r k 
or ( 1 A T, Bı 0 
daa Ga ae 
or mesi rBi, + kT.,~T 
kr a kr 
or B rBi, + RT-T.. 
~ (k= D(r-D 


Th : 
€ formula in general may be written as 


*5 eT 


number of blocks or replicates, 


k= number of treatments, 


sum of the remaining values in the blocks 
observation, 


B= with the m: -sing i 


y 


INTRODUCTION TO 5 


; T= sum of the remaining values of the treatment With mi 


observation, and istiy 

G = grand total of all the observed values. 
The analysis of variance is then completed and the tet y 
2 0 


significance is carried out. 
If two or more observations are missing, we use the followin, 
iterative procedure proposed by Yates: g 


. (i) Assign guess-estimates such as the mean of the actual] 


sates : y Observed 
values, to all but one missing observations. 


(ii) Estimate this one missing observation, using the formul 


a for ong 
missing observation. 


(iii) Estimate then one of the other missing observations, Using the 
estimate obtained in (ii) and the remaining previously assigned 
values. i 


(iv) Estimate the remaining observations and complete the cycle, 
(v) Repeat the procedure using the estimates. 
(vi) Continue the procedure until the estimates become stabilized, 


(vii) Subtract one degree cf freedom for each missing observation 
from the total and error sum of squares, 


The ‘analysis ‘of variance is then carried out as usual, though this 
procedure is not strictly correct. The estimated S.E. of the difference 
between the mean of a treatment with a missing observation and the 


mean of any other treatment is a Fs = , 
r(r — 1) (k — 1) 


Example 23.4. The following data, obtained from a randomized | 
‘complete block design with 3 treatments A, B and C, and 3 blocks 


contain one missing observation represented by x. 


Estimate the missin 


+s of 
observati Py - analysis ° 
variance. 8 observation and prepare a table for a 


peERIMENTAL DESIGNS 
ex 


We use 


ace eee 


rB + kT-G 
— e 
(k - 1) (r-1) 


e the missing observation. 


the formula 


to estimat: 
Here B= sum of the values in block III 
missing observation) - 


= 8+ 16 = 24; 


(ie, block containing 


| 
i] 


sum of the values of treatment C (i.e. 
containing missing observation), 


= 15 + 14 = 29; and 
G=5+7+...+15+14=87 


» treatment 


Substituting in the formula, we get the estimate as 


3 (24) + 3 (29) — 87 
qa 
(3 — 1) (8-1) 


Putting 18 in place of x, we carry out the necessary computations as 


= 18. 


below: 
B 
r 
Total SS =>> o -— 
LJ 
2 _ (105)? 
= (5)2 + (7)2 +... + (14)? + (18) -g 
= 1383 — 1225 = 158 
‘ 2 
É T 
Treatment SS = z se 
J r n 


2 ae 
(20)2+(38)2+ (4)? _ ar = 1351-1225= 126 
= ee, E 


os ; INTRODUCTION TO STATISTICAL THe 
~y 


2 
K TT 


L 


Block SS =X 7T 


(82)? + (31)? + (42)? (105)? © 


C = 1249.7- 1225 = 24.7 Eas 

Total SS — (Treatment SS + Block S39) 
= 158 — (124 +°24.7) = 7.3 

Hence the ANOVA-Table is 


Ml 


Error SS 


Source of 
Variation - 


Treatments 
Blocks’ 


Error 


It is to be noted that total and error degrees of freedom have been 


decreased by 1. 


Now, we can test Hy: T = 0 for allj in the customary manner. 


23.4.6. Estimation of Missing Observation by Covariance. 
The analysis of covariance technique can also be.used to estimate the 


missing observation. The procedure followed is described below: 


(i) Insert zero for the missing value, i.e., sety =0. 


Gi) Introduce a dummy covariate consisting of a’ one associated with | 
` the missing observation and zeros for all other observations. ` 


(iii) Carry out the covariance analysis. 


(iv) The estimate of the missing observation is given by 


E 
BBR elf es te a 
2 ly Se 


ex 


, 
where yọ = 0, Xo = 


products and the error sum of squares obtained from 
covariance analysis table, ` | l 
T> the case of two 


“W issi y ; 
each missing value and a multinle cove-ian-e is computed. 


pelo 


: of 
= land E,, end Esx are the error sum 


5 aaa O teen duce 
or more missing values, a covariate is introdu 


epIMENTAL DESIGNS 
p 


- 43 


1 : (Yo) fi i 
Baw plying the covariance technique: ~o [zom the data given 
wW z 
Treatments 


I 
II 
II 


In order to carry out the analysis of covariance, we insert zero for 
the missing observation and pair the observations with the covariate 
omposed of zeros and a one as shown in the table. ` 
c 


Source of 
Variation 


24 67=-1.67 
2561 _g4,=13 | 77° 


——_— 


= —8.00 
= 151 


ole © 


(By diff.) = 


INTRCIUCTION TO STATISTICA 7 


The analysis of covariance table is then set up as below. 


j IGNS 
He peRMENTAL DES 441 
a z without Blocks: 
I. 


š =(r-1)s? 2 
Source of | d.f. Sum of Squares and Regression Error SS without blocks = (r — 1) s% + r (k — 1) Se With (rk - 1) d.f 
Variation Products Co-efficient (1) 92 (k 2 
r=1)s;+r(k-1 
:mate of Error Variance = b ae 
Estima rk-1 
Block 2/9 


Treatment 


Hence the required efficiency = ratio of error variances 


oy 2 
Error = ain 4 Sb A a s 


(rk ~1) 8? 


_ estimated MSE for a CR design 
-MSE for the RCB design 


In this discussion, we have assumed that in either of the designs, 
xactly the same experimental units or plots would have been used. 
e 


Hence the estimate of the missing observation = —b = 1 8. 


23.4.7. Efficiency of a RCB Design Relative to a CR Design, 
An experimental design is to be more efficient than another if it attains 
the same precision with less expenditure of time or money, and its 
relative efficiency is usually expressed in terms of a ratio of error 
variances. Tris sort of expression is considered straightforward and 
practical as it} can be directly interpreted in terms of the amount of | 
replication required by the design to attain a given precision. As such, RE 


As a example,-the relative efficiency (RE) of the RCB design given 
in Example 23.3 as compared to a CR design is calculated as 


(5 — 1) (5.36) + (5) (4—1) (2.19) 


the experimental units may be subjected to the same or dummy 
treatments on account of the fact that.we are concerned only with errors, 


This implixs that the treatments may be ignored. The analysis of | 


variance for the RCB design with k treatments and r replications, is 


Source of Sum of Squares Mean 
Square 

2 

Si 


Variation 
Treatments 


Blocks 


Error 


For efficienry, we ign 
(i) With Blocks: 


Source of 
Variation 


ore treatments and get 


(5 x 4 — 1) (2.19) 


_ 21.44 + 32.85 _ 54.29 _ 1 39 or 130% 


This indicates that the RCB design is 130% as efficient as the completely 
randomized design. 


23.5 THE LATIN SQUARE DESIGN 


4 rolling the 
The experimental error in RCB design is reduced by controlling 


meee ‘ection, Le, by grouping the 

Source of extraneous variation in one maa ~ a d in two 
i i a vari ai 

experimental units in one way. When the two sources of variation 


directions, it becomes necessary to remove these Itaneous blocking of 
simultaneously, This end is achieved by aed ections called, rows 
“perimental units in two mutually perpendicular dir omplete block, the 
and columns. Since each row and each column Is pi i imposing the 
rouping for a balanced arrangement is rd only once in each 
striction that each treatment must appear p ia are k treatments, 
tow and once and only once in each column. niyi Bi k-columns, 
the experimental area will be divided into k experiment is laid out . 
resulting in k? plots or experimental units, as t he at random sl aoe 
M a square Pattern. The treatments are ine oe experimental units an 
or experimental units. Such a double blocking ° a Latin 


is called 
‘onments is C4 
a correspon ding doubly restricted random assign 


442 INTRCOUCTION TO STATISTicg 


‘Square (LS) design, following Euler who used Latin letters fo, 
or treatments. - . 
Hence a Latin Square (LS) design is an arrangement 
i ireatmnts in a k x k square, where the treatments are 
blocks in two directions, the directions being orthogonal to 
and to the treatments, and where the treatments appear onc 
once in each direction. It should be noted that in a Latin Squ 
the number of rows, the number of columns an 
treatments must all be equal. 


L THEORy 
SYmbql, 


Of k 
Brouped iy 
each Other 
© and on} 
are desi 

d the number of 


23.5.1. Construction and Layout. Latin Squares ar 
constructed by rotation, e.g. in the case of 4 treatments A, B, 
we get ý 


e always 
C and D, 


A large number of distinct Latin squares can be derived by 


interchanging rows and columns of a standard Latin square, where a: 


standard Latin square is a Latin Square in which the treatments in the 
first row`and in the first column are arranged in alphabetical or 
numerical order. A standard Latin square is also called a reduced Latin 


square. There is only one standard square for a 2 x 2 Latin square, viz., 


uare, there is only one standard square 
afe] 
zle] 


quare, there are four possible standard squares as 


But fora 4 x 4 Latin s 
shown below: 


DESIGNS 


NTAL 443 
gRIME re 
ex? ssible standard forms of a 5 x 5 Latin square 


The po ; : are 56 and so 
l a dard square of size k x k yields.k! (k - 1)! different Squares 
oD. Each ing the rows except the first and all the k columns, 

ae we may obtain from a square of size 
gly 


Accordin x2, 2 different squares; - 
0 : % 9. 12 different squares; — 
ái) i 4 576 different square; 
F ay 5 161, 280 different square; etc. 
(iv. , 


designing a Latin square; one square is to be selected at random 
In : j 


i all the possible forms of the k x k Latin squares and thereafter, the 
` from 


d the columns are randomized for the purpose of field layout. 
rows Té experimenta] layout for the field having 5 treatments A, B, C, 
ee E, might be as follows: 


si in square design should occupy 
-Itis conceptually desirable that a Latin sq see me 


a square experimental area, but in certain ee Se poate 
laid out otherwise, if the purpose of the LS design, +e 
v variation in two directions, is achieved. 
23.5.2. Statistical Model and Analysis. a 
akxk Latin Square design may be arranged in tabu 
COLUMNS 


e data obtained from 
ar form as below: 


444 INTRODUCTION TO STATISTICAL Th 
E 


‘The treatment means may be represented by F.. As Tog, Five agi 


The yield Yin) of the ijth. observation in a LS design wit 
observation per plot or experimental unit, may be represented ty a 
linear statistical model he 


Vijay =u+ Ri + C; +T, + eh) i,j, h= l; 2, m k, 


where H, Ri, Cj, Tp and eijn) denote the general mean, the effect ane 
ith row, the effect of the jth column, the effect of the Ath treatment Sal 
experimental error respectively; and where ij(h) are assumed to be 
normally and independently distributed with mean zero and common 
variance 6%, In this model, the interaction terms are assumed to be zero 
and the subscript h being not independent of i and j, is placed T 
brackets. To test the hypotheses that 


(i) all treatment-effects are Zzero; 
(ii) all row-effects are zero; and 


(iii) all column-effects are zero, 


we find the least-squares estimates of Ul, Ri Cj and T, by minimising the 
quantity 


2 
Q@=2> Cn) 
subject to the restrictions that 
T F50 EO = o0 and Zr =o, 

For partitioning the total sum of squares into components for 
rows, columns, treatments and error, let us construct the following 
identity: , 

Yih = Ys iF. = Y...) 

+ (Y.-F...) 
+ (Y.a - Y...) 


— Row term 
— Column term 


— Treatment 


+ Yj - F; — Yp- op Error 


m Squaring both sides of the identity and summing over all values, We 


LÈ Yy- a oe s 5 5 
ij(h) kX È,- Bota h > ye Fi)? 
- tk Y.-F ye 7, 3 _3 ¥...)? 
2 (Yi = È...) 4 =D [Yj - ¥;.. - ¥j.— F.n + 2Y] 
+ cross-product terms which vanish, when summed. 


gpimeNTAL DESIGNS s 445. 
p um-of-squares identity may give rise to Some confusion with d 


yo the summation process over i j and h, The confusion | 
respect rs by an intelligent understanding of the following relations: 
a . 
gisapP® F...) = EÈ Ypg- Ë...) = — 
SE gan PI? = Ee yoy ~ Yo? = EE ay -7.9 
= 2 (Yin) = Y...)2 
ing the row, column, treatment and grand totals by R,, C., T 
penging i f E on a 
G respectively, the various sums of squares terms are simplified as 
and > 
G 


Total SS = XÈ (Yj) — ¥...)? = zy Yim- gz 


v7, oye. 
Rows SS = 22 (Y;.. Y- a k Py ; 


and the sum of squares for experimental error may be obtained by 


subtraction. p 
The resulting analysis of variance is shown in the following table: 


ANOVA-Table for a kxk Latin Square Design 
Mean Square 


Source of d.f. PORRE | Manae 
Variation 


artes 


446 INTRODUCTION To STATIST 


ICAL TH 


i x 
Hence to test the hypothesis that treatment means are m ORy 
compute the ratio he 
2 
Sı _ MS for Treatments 
a = K _ MS for Error 
e 
which conforms to an F-distribution with v, = k=-1, v= (k: Dik, 
d.f. if the null hypothesis is true. The hypothesis would be rejected ve. i 


F 2 Fa; (vj, Vo). 


In a similar way, we can proceed to test the hypothesis that al] tow. 
effects or column-effects, are zero. 


The least significant difference (LSD) to test the difference 


is i betwee 
two treatment means selected at random, is given by 


The mean yields for all the five fertilizers are equal, and 
Ho: 

2s ; Hy 
BSDE ta/2 R’ Where% is the auniber at me erate (ii) The level of significance is set at a = 0.05. 
ii 


on 


Not all mean yields for the fertilizers are equal. 


23.5.3. Advantages and Disadvantages. A Latin s 


quare design (iii) The test-statistic to use is 
has the following advantages and disadvantages: ; ` MS for fertilizers 
- fee 
(i) A Latin square design reduces the error variance by controlling | AGS Torertor 


vie aap hich, if Hg is true, has an F-distribution with v, = k ~ 1 and 
ae 7 : F which, 0 i . 
(ii) The analysis of a LS design is simple and remains relatively va = (k — 1) (k — 2) degrees of freedom. 
7 


simple with missing observations, 


(iv) Computations. 


(iii) A Latin Square design is generally more efficient than a 
randomized coriplete block design. 


(iv) A Latin square design is less flexible than a RCB design, It is 
Practical only for 5 to 10 treatments. When the number of 
treatments exceeds 10, the design is seldom used. 


(v) ` number of treatments, a LS design does not provide | , 


Tor a small 

» Sufficient number of replicates to give a valid estimate of etror. 

(vi) Replication in a Latin square design is costly. 

(vii) In agricultural experimentation, the land requirement is rigid, 
the actual layout in the field may be laborious and approach to 

most plot becomes difficult. 

a te 23.6. Five fertilizers A, B, C, D and E were tested zi 

r: ng plants in a Lati ' a R an 
columns in the ai n square design in a field. The rows 


le are rows and i ieldsi® | 
bushels per plot are as A. columns in the field. The y. rea 


the central 


ta jliz2r or 

for a ferti 21 

Siren of the fertilizer 
valle 


om, ents 
Summary for Fertilizers (ot eae 
tment is obtained by adding all the 
treatment in the square). 


ewes 


448 INTRODUCTION TO STATISTICA 


Cc D 
84.2 32.3 65.6 39.8 
6.84 646 13.12 7.96 


24.6 
4.92 


‘ = G2 
Now, Total SS (S,y) = Ze. Loe oe 


= (4.9)? + (6.4)? + ... + (15.9)2 4 (7.6)2 ~ 196.52 


25 
= 1829.83 — 1544.49 = 285.34 


i _ G2 7955.79 
Rows SS (Ryy) = 2y H ae aia 1544.49 


= 1591.16 — 1544.49 = 46.67 


2 
Ci G2 7799.55 
= - x =s —— 
Columns SS (C,,) z Te 5 —— 1544.49 


= 1558.51 — 1544.49 = 14.02 
"E 


Fertilizers SS (T) = pl _ Q? 
izers ‘y) ™ co 


= (84.2)? + (82,3)? +... + (24.6)2 (196.5)? 

= er + (24.6)? _ (196.5)? 
5 25 

= 1741.10 — 1544.49 = 196.91 


Error "SS (E) = Syy = Ry + Cy + Ty) 


= 285.34 — 


(46.67 + 14.02 + 196.61) = 28.04 
These results are arranged in th 


e following analysis of variance table: 
Sum of 
Squares 


Source of 
Variation 


Computed 
Rows 
Columns 


Fertilizers 


: THEORy 


NTAL DESIGNS 


RIME : = 
z he critical region £s F = Poosi (4, 12) = 3.26 
i ; 
w T saelusion. Since the computed value of F = 21.00 falis in the 
w sana region, we therefore reject our null hypothesis, Hence we 
ee’ that the data provide sufficient evidence to indicate at 
ae significance level that the mean yields for the fertilizers ` 
e 


` are not equal. 


compute the least significant difference (LSD) to 
Next aah fertilizer is to be considered the best one. Thus 
ew ‘ A 


. 2 (MSE) 
t0.025,(12) k 
= 2.184 |2 can = (2.18) (0.96) = 2.09 bushels 


ing the mean yields for the five fertilizers in ascending order and 
ae ale line under any subset of adjacent mean yields that do not 
drawing 


determin 


u 


LSD 


` differ significantly, we get 


E B A D C 
4.92 646 6.84 7.96: 13.12 


3 ; eld for fertilizer C 
-` The fertilizer C is the best one as the ani i 
i rer fertilizers. 
differs significantly from the mean yields of all of seat 
i PANAR 
-e e. Tne forms 
23.5.4. Missing Observations in a Lat'n nner ee 
a : uare is developed 4 
ora missing observation in a kxk Latin square îs deveioper er 
- Sservation in the irst -4 
For convenience, we assume that the i represent it by *- 
? x š g -a 
‘he first cohimn and for treatment one, is missing. d values in sh? fr" 
: totals of observe + total fèr 
Let Ry’, C,' and T,’ represent the totals respectively. The tot 
"OW, of the first column and for treatment A ihe totals of 0 
¥ . 
all thè observed values is denoted by G n and Tp respective 
columns and treatments are denoted by Rp Cj 
a k 


ther rows, 
iy, wae" 


| i , ~-ynimize 
observation: we 


ig SIA 
we pay write BP 
t J 


a 
To find the best estimate of the missi sumptions, 
error sum of squares. With the above as 


OUs sums of Squares as follows: 


the 


N 


INTRODUCTION TO STATISTI 
A 


450 


Source of Sum of Squares 


Variation 


k 
1 ' 2 27 _ (G'+y)2 
Ry = 5 [Ri + x) + ER]- S2 


k 
eh Tt +x)? 27 _ (G'+x)2 
Columns Cyy “k [(c, t x) + 2G a 


E 
Ty =>[(r; +x)? ri 


k 
Treatments 


27 (G'4x)2 
271 a 


-| Error Ey = By subtraction 


(G+x)2 
suð = 
Syy = x" + yy Yj p2 


The error sum of squares is then obtained as 
Eyy = Syy- (Ry + Cyy + Tyy) 


2(G' + x)2 
+ - 


=x? a -ziw +a)? + (C! +24 T 4 al 


+ terms not involvings, 


Differentiating Epy wrt. x, equating the derivative to zero, an 


solving for x, we get 


4(G' , U 
Be On ee ASH 2 pe +x)+(C +x) + (r +a] 


ba , + r 2G’ 

2 k} k “ZR +0 +f) <45 
(R+ C + 7") —9@" 
m (Paea : (R+ C +T) 2G 
k2 a r 


R(R' +C + 7") — 2g 
Hence x=—1__1 “1 


— 
(k - 1) k- 2) 
Uk? 3k + 2 = (k-1) (W 


; : jth 
ation corresponds to the ith row, p 
the formula holds and may be written 


When the 


missing obsery 
column and tke 


hth treatment, 


LTH, per MENT a ae 


RR, + C, + T.) -2G 
x= (k — 1) (k — 2) 


substitute the value of x and carry out th 

we lean usual way, excepting that we zeduce t 

e in ciated with Error and Total by 1. The formula for the 
ee of the difference between a ‘reatment mean and the mean 
d ae with a missing observation >ecomes 

1 


(Rk — 1) k-32 


e analysis of 
he degrees of 


ofa treatme 
272 
Se E + 
two or more observations are missing, the formula for ‘a 
aig observation is repeated exactly in the same way as we did 
single m 


domised complete block design, subtra:ting 1 from the degrees 
in *~ con associated with error and total for each missing observation, 
of free 


2 2 2 
23.5.5 Efficiency of Latin Squares. Let S, S, and s, denote the 


squares for rows, columns and error in a k x k Latin square. Then 
Sr aiany of a LS design relative to a RCB design is estimated as 


(i) taking rows as blocks, by the formula 


2 2 2 
s.+(k-l)s, 17 

Efficiency = -_—_, — =F É + (k-1) ] 
ks, S, 


(i) taking columns as blocks, by the formula 


2 
s+ kh-Dsy 1 
~~ 2 
e 


3 ' 
S, 

Efficiency = l+ +(k-1) ] 
ks Se 


i ‘on relative te a completely 
‘To evaluate the efficiency of a LS design relative te a 


randomised (IR; design, the appropriate formula is 


2 
spt sy k- D 8, 


Efficiency = 7 

(k-1)s, tin Squarz. 
aco: Latil 

23.5.6, Orthogonal Latin Squares and orcad one square 

0 Latin squares are said to be orthogonal if eac uae when they are 

Scour s exactly once with every letter of the other e san with Latin 

eti posed. If we write one of the two orthogonal st ing the two 


erimpos s 

w and the other with Greek letters, Oe ds eal Latin oF a 
si Taal j 

Wer Saleen see pe ane each column, and 


in 
“PPears exactly once in each row end 


; where R; C;, Th 


i’ ___ INTRODUCTION To STATIS Iga - - 

3 ears exacily once with each Greek lett Wo 
Latin letter appears exacily i ervet: S 
(or design) is called a Graeco Latin Square (G-Ls), K 
have two orthogonal Latin squares, one with Latin let 
with Greek letters. 


then, by superimposing, we get the following 4x4 Graeco-Latin square 
Aa BB Cy Dé 
By Ad Da cp 
C Dy AB Ba 
DB Ca Bô AY 
The analysis of variance appropriate to this design is given below: 


Source of 
Variation 


Rows 


Columns Vs Cy 
k © k-1 
Latin Letters 2 1y 
2 yy 
(Treatments) 4-1 T eT th _@ o> 
V SE k 
| Greek Letters 


and Q; are the 


7 hth 
i total ith r th column, 
Latin letter (Treatment), lth Gr oo 


‘al 
eek letter and G stands for the gt@ 
total. The assumpt : i rS desig’ 
is that the a for the analysis of a k x k G-LS 


ion necessary | 
; ; jstica 
mode] ations may be represented by the linear statist 


c 


453 
RIME 5 
a apt Rt Cet TH+ Q+ cyan, ihe a: 
Yin) 
; E ĉ = Th = 2 Q = 0, and where ; 

R=2 G ry es y: s Cith are 
„here shaa and normally distributed with zero Mean and common 
indepen a 
yariance O“ 


g design is used when we desire to control three sources of 
AGL e source of variation being controlled by rows, the other 
ad and the third source by Greek letters. The degrees of 
ia associated with error would be inadequate when k is less than 4, 
free 


shown that a 6x6 Graeco-Latin square does not exist. 
It has gu Cox have given the layouts of the Graeco-Latin Square 
Cochran an ll numbers of treatments from 8 to 12 with the exception of 
aa » their book Experimental Designs. 
an 


jation, 
variat 
source bY col 


ad It should be noted that no more than (k = 1) Latin squares of order 

be orthogonal. A set of (k — 1) orthogonal squares of order k, is 
k can be lete set of orthogonal Latini squares. Such complete sets 
a ini whenever & is a prime number or the power of a prime 
can . 


number. 


23.6 SINGLE DEGREE OF FREEDOM CONTRASTS 


In many experiments, we may desire to partition the eae 
Squares for Treatments into a number of components, each a 
; the treatment T’s. 
degree of freedom. Let Qj be any contrast sao 
the sum of squares for this contrast is computed by 
2 
(contrast)? : Q; 
Q; SS = : - _ 
j rÈ} (contrast coefficient) rey 
5 tals in the jth 
Where C's are constant co-efficients of the anes alae 
contrast and r is the number of observations in eac 
2 
y ` Treatment and i 
me is a component of the sum of squares for 
: t Sum of Squares with 
Tepresents 1 degree of freedom. The Treatmen Hale 
: itioned into (k-1) in ; 
1) degrees of freedom may thus be parti apes ets fA 
test, each based on 1 degree of freedom. 
Ve 


g? 

2 kel 
Qi Q RE ae ae 
+ TL -i(eeD) 


Treatment SS = 


pot ee 
pd rLein 


q 


>> 


S 

454 INTRODUCTION TO STATISTICA y i pRIMENTAL e 

E E 4 le. Suppose we wi o ee Main Effects and Interaction Effects, T 
Let us consider an example pp Wish to Compare 93.7-1- among the treatment combinations are called the Eff ae 

; ; ; ; ects, 
the mean of T, and T3. Then mi 60 arisa epresented by capitals A, B,C, ete. These effects are 
Qı = 27, — (T + T3) T =p tal which a of main a ent phoma A main effect of a factor is 

r J~ +0 osi r the average change ; 7 
f re for this contrast is ty m das a measure 0 It j E ge In effect produced by 
and the sum of squa define the level of the factor. It is measured independently of other 
2 pang! ; ffect of the factor only, Factors are said to i 
= 2 c s the e : i 0 interact, 
2 - LT,- (T; + T9]? factors oe I are not independent. But Interaction in a factorial 
rYer r [22 + 2(-1)2] a ett is a measure of the extent to which the effect of changing 
exper! 


Similarly, for a contrast defined by 
Qe = 2(T, + T3 + T3) -3 (Ty + Ty) 
the sum of squares will be 
2 
[ot + T, + T3) - 3(T, + T;)]2 
——— = a OT 
rQev r [3(2)2 + 2(-3)2] 


The mean squares of these contrasts are taken as the numerator for f 


an F-test to investigate whether the effects estimated by these contrasts 
are zero. `“ 


23.7 FACTORIAL EXPERIMENTS 


Experiments are often planned to investigate the effects of Say, 
different rates of ‘fertilizers, different dates of planting, different 
categories of education, different intensities of. a stimulus, ete 
Technically Speaking the independent variables such as fertilizer, 
planting, education, stimulus, etc. are called factors, while the values 
such as rates, dates, categories or intensities at which a factor is held 
fixed, are known as levels. It is customary to represent the factors by 
small letters a, b, c, etc, and a particular level by small letter with à 
subscript a,, bj, cp, ete. A treatment is then determined by a combination 
of different levels of factors 4% by Chs ..., etc. For example, if we have W 
‘factors a and.b, each at 2 levels zero and one (i.e., i = 0, 1; j = 0, 1), then 
the four treatment combinations are © 


agh, abo agb; and abı. 
An experiment is calle 
consist of all possible com 


When each t- 


the factorial experiment is known as a complete factoria 


jmen 
nt is described as p” factorial pe 
‘Ors and each factor is considered at p levels ¥ 


being the exponent). Thus a 22 or 2x2 facto” 
ans 2 factors each at 2 levels. 


number of. factors 
experiment me 


“fact 


imes 
S used the same number of time | 


Is of one or more factors depends on the levels of the other 
the levels ractions between two factors are referred to as first order 
ors. Jnte those concerning three factors, as second order interactions 
Leal main effect is sometimes regarded as an interaction of zero 
ee concepts may also be expressed in symbols, 

oraer. 


7.2. Effects in a 2 ?-Factorial Experiment. Let the symbols 

aot 1; j = 0, 1) represent both the treatment combinations and 

dis from all experimental units or plots. Then the main effect of 
the É i 
factor a is determined as below: 


Effect of factor a at level bg of factor b = a; by — ag bo 
Effect of factor a at level b, of factor b = a,b, -agb, 


Main effect of factor "a" = Average change produced by varying 
factor a. 


Lla; bg = ag bo) + (4b, ~ agb )] 


(aj — ap) (by + bo) 


factor "b" 


[lag b; — agbo) + (a; bi- a; bo)] 


Eh 


Similarly the main effect o 


dle ble 


(a; + ao) (b, ~ bo) 


tofa 

, tly, the effec 

If the two factors a and b were acting mm oe effect of b at 

“tbo and the effect of a at b, or the effect of b at pirani This difference 

ĉr should be eqùal, but, in general, they will be di act, Hence A X B, tna 

> measure of the extent to which the factors se Jevels zero and 1, 18 
e 


$ : t 
“action between two factors a and b, each a 
Elven by 


456 


3 AxB = 


Sole dite 


[laib -a bo) pe (ag by = ao bo)] 
L ENE — Bed, 
= 9 (ay Qo) 01 0%: 


It is clear frora this relation that interaction between facto 


i TS a ang Bits 
AB is the same as that between b and a, Le. BA, ees 


The overall mean is represented by M and is the average of a} th 
yields, i.e., i i 


M = 7 (dy bo + nbg + ay by + a,b, 


= + (ay + a;) (bo + b,) 


. Replacing the symbols ag and by by 1, and the symbols a 
a and b, (i.e., writing (1) for ao bo, a for a, bo, b for ag 
a, bj), the preceding comparisons may be expressed as 


1 and b; by 
by and ab for 


A=2(@-) 641) 

2 

Bst@s iy 

> 9 a@+ 16-1) 
1 

AB = She 3) b= 1 


1 
herd +D +l) 


These effects can b 


i e convenientiy written in a table of plrs and 
m'aus signs as below: f 


Divisor 


Treatment Combinations 


It shoud 
orthogenal co: 


be noted that the effects 


itrasts of the Yields of tha 4 treatments, each based 0n J 


3 _'NTRODUCTION TO STATISTICA, j 
= O Ty i 
[(a bj — ao by) — (a; bo = ag bo)] X 


A, B and AB are 3 mutually 


expen! 


egree O 
ma (1), 


Factors a, b and c, each at 2 levels, Representin 
e 


-ations by the symbols a;bjc, (i = 0, l,j =0 
in 


consi 


ENTAL DESIGNS 
M freedom. Further, the treatment combinati 
f fr b, ab are referred to as standard orde 
a, V 


actions, a5 defined before, can be represented by t 


inations wri 
r (form), Í 

e 3. e a k i 
Effects in a 2 3-Factoria] Experiment, In this case, we F 
E the treatment - 3 


» -iR = 0, 1) we get 8 
ombinations: (1), a, b, c, ab, ac, be and abc. The main effects 


inter i , he following 
and Me which on being expanded algebraically, give expressions for 
saci of a treatment combination: 
the y? . 
A =7@-DO+HD)E+); 
LEL [-(1) + (a) — (b) + (ab) — (c) + (ac) = (be) + (abe)}, 
4 . 
sti b= 1) (c + 1), 
B =ł3@0+D( 
ad b + 1) (c - 1), 
C sg (a + 1)( 
=+@-1)-)DC+D, 
AB =—@ 1) ( 
=1(a- 1) (c= 1), 
AC =37%@-1) (b+ 
BC =4(a + 1) (6-1) (c = 1), and 


ABC =i(a-1) (b-1) (e-1). 


red i lowi form: 
These expressions can be presented in the following tabular 


d + Divisor 
Treatment Combinations (in standard form) . reel 


(ab) <c) (ac) (be) (abc) 


458 INTRODUCTION To STATISTICAL 
PAL They 


This whole thing can be extended to include n factors, al a 


levels, by considering the expression 


1 
gn-1 


2/2 


(@=1) (b+1)(c+)]1)..., 


where a minus sign appears in any factor on ri 


ght if the cortespondi 
letter is present on the left. 2 


. 23.7.4. Design and Analysis for Factorial Experj 
factorial experiment is not considered as an experimental de 
of the fact that the basic designs, namely, the CR design, the 
and the LS design, are used to carry out the factorial exper 
point to make here is that the treatments have a factorial str 


ments, 
sign because 
RCB design 
iments, The 
ucture, 


For the purpose of analysis of variance, the basic SUMS of squares 
are computed in the usual manner, with the addition that the treatment 


error mean square. Sometimes, we may not be interested in interactions, 

then we pool their sums of squares, assuming that they do not exist. 
Let us consider a 22-factorial experiment which has been carried out 

block design with r replications. The statistical 


treatment effect, are computed as below: 


Defining the effect total by [], ie., 


[A] = -{[1] + [a] — [b] + [ab], ete. 
the pertinent formulas are 


SS for main effect A = (Contrast)? = [a 
r$ (contrast coefficient)? 4r 


SS for main effect B = By and 
r 2) 


SS for interaction AB = [AB]? 
4r ` 


1 
1 
1 


3(r—1) | By subtraction — = 
ee a ee 


To test the appropriate hypotheses regarding main effects A and B, 

T interaction AB, we calculate F-statistics with error mean square 
P denominator. If the treatment MS does not prove to be significant, 
in 


“no question of examining the effects A, B or AB arises. 


The sums of squares for effects may also be obtained by constructing 
a2-way table for the factors a and b. 


For the analysis of variance of 23-factorial experiment, be 
treatment SS would be partitioned into 7 components, A, B, C, AB, ; 
BC and ABC, each associated with 1 degree of freedom: The = x 
squares for any effect is computed by [effect]?/8r, where r represe. 
number of replicates. PUR 

Example 23.7. A 22-factorial experiment, ie, Wi i 
(factors) and 2 manures (levels), was carried out in T pi 
complete block design with 3 replicates. The yields given in 
table, are hypothetical. 


Treatment Combinations 


-V| 
i Uo my [om | 
Replicates 


10 
8 
1 5 7 5 
2 4 4 1 5 
9 1 
3 6 4 = nificance of 
gnifica 


: t the si 
Perform ‘the analysis of variance and tes 


eties (factors) and manures (levels). 
(i) 


Varj 


We formulate our null hypothesis as 


Ho: There is no difference career 
If it is rejected, then the other 


jnations. 
reatment combination 


es are 


458 INTRODUCTION To STATIStip 


This whole thing can be extended to include n fa 
levels, by considering the expression 
1 
pei @ +1) (b+1)(c+1)..., 


where a minus sign appears in any factor on right if the correspon diy 
letter is present on the left. z 


` 23.7.4. Design and Analysis for Factorial Experiments, 
factorial experiment is not considered as an experimental design because 
of the fact that the basic designs, namely, the CR design, the RCB desi 
and the LS design, are used to carry out the factorial experiments, The 
point to. make here is that the treatments have a factorial structure, 


For the purpose of analysis of variance, the basic SUMS of squares 
are computed in the usual manner, with the addition that the treatment 
sum of squares is further partitioned into component parts of main 
effects and interactions. It is interesting to note that the best estimate of 
any effect is obtained from the contrasts shown in the preceding tables of 
plus and minus signs. The significance of all effects is tested against the 
error mean square. Sometimes, we may not be interested in interactions, 
then we pool their sums of Squares, assuming that they do not exist. 


Let us consider a 22. factorial experiment which has been carried out 


in a randomised complete block design with r replications. The statistical 
model as before, would be 


treatment effect, are computed as below: 
Defining the effect total by [], Le., 


[A] = -{1] + [a] — [b] + [ab], etc. 
the pertinent formulas are 


SS for main effect A = (Contrast)? = wr 
i rD(contrast coefficient)? 4r 


SS for main effect B = er and 
r + 


SS for interaction AB = [4B]? 
4r ` 


ctors, all at 


exPeRIME 


lable, are hypothetical. | 
i Treatment Combinations 


[A]?/4r 
[B]2/4r 
[AB]?/4r 


| 3(r—1) | By subtraction — EEF 
ae aa ma a 
To test the appropriate hypotheses regarding main effects A and B, 


he interaction AB, we calculate F-statistics with error mean square 
oe denominator. If the treatment MS does not prove to be significant, 
in 


“ho question of examining the effects A, B or AB arises, 


The sums of squares for effects may also be obtained by constructing 
a 2-way table for the factors a and b. 


For the analysis of variance of 29%-factorial experiment, ie 
treatment SS would be partitioned into 7 components, A, B, C, AB, i 
BC and ABC, each associated with 1 degree of freedom. ae 
squares for any effect is computed by [effect]2/8r, where r repres 
number of replicates. a 

Example 23.7. A 22-factorial experiment, piled oe 
(factors) and 2 manures (levels), was carried out in ine EP 
complete block design with 3 replicates. The yields given in 


Replicates orm, | nm | Yam | 
8 
1 


a 10 
1 5 a, 5 
3 6 4 z significance of 


R t the 
Perform * the analysis of variance and tes 
eties (factors) and manures (levels). 
(i) 


Vari 


We formulate our null hypothesis as 


Ho: There is no difference betwe hypot 
If it is rejected, then the other 


eatment combinations. 
en tr 


heses are 


460 INTRODUCTION To STATISTICAL TH MENTAL DESIGNS 


461 


Repli- Treatment Combinations 
cates 


1 
2 
3 6 (36) 


(a) The variety effects are zero. 

(b) The manure effects are zero. 

(c) There is no interaction effect. 

These are tested against the appropriate alternative hypotheses, 
(ii) We choose the significance level at a = 0.05. 

(iii) The test-statistic to use is 


MS for the effect 


ae MS for error 


which, if Hg is true, has an F-distribution With appropriate 
degrees of freedom. 


(iv) Computations. The sums of squares are calculated as below: 


| pet 


2 


s 12 


Total ss= xy v ~ CF. = 621.00 — 546.75 =74.25 


p? 
Bl =y ý 
ock SS 2 k TCF., where k is the number of treatments 
_ 2261 
4 C.F. = 565.25 — 546.75 = 18.50 
2 
Treainent ss = S ; 
ia 7 a r TCF., where r is the number-pf replicates. 
_ 1755 
© 3 TGF. = 585.00 — 546.75 = 38.25 


SS = 74.25 — (18.50 + 38.25) = 17.59 nore 


Error 
compute the sums of squares for main e 
0 


ffects and i . i 
d the effect-totals as ‘i 'nteraction, we 
n ; 


frst fi i 
[v] = [vm] + [v] — [m] - [1] = 27 + 24-15-15. 21, 


m) = [vm] — [v] + [m] - [1] = 27~ 24 4 15-15 =3 
[VM] = [vm] — [v] - [m] + (1] = 27-24-15+15=3 
[V]? _ (217 


Hence SS for Varieties = “ar = ig = 36.75, 
2 3]2 
SS for Manures = m = er = 0.75, and 


[VM _ [372 


SS for Interaction, VM = ae a 6.75. 


The'sums of squares for Varieties, Manures and Interaction VM can 
also be computed by forming a 2-way table as below: 


_ Varieties 


(15)? +... (27)? _ C.F. = 38.25, 
3 


Which is SS for treatment combinations. (It has been divided by 3 as 
tach sum comes from 3 plots). 


Total SS (for this table) = 


Variety gg = Bor + ou" C.F. = 36.75 
Manure SS = eot t amt. CF. = 0.75 


Interaction VM = Error SS in this table 


= 38.25 — (36.75 + 0.75) = 0.15, 


Wh; 
hich are the same. 


462 


These results are summarized in the following ANO 


Blocks 


Source of 
Variation 


INTRODUCTION To STATISTICAL ty 
EOry 


VA-Table; 


Sum of 
Squarcs 


Treatments 
Varieties (V) 
Manures (M) 

| Interaction (VM) 


(iii) 


(iv) 


For critical regions, the table values of F at Q = 
appropriate degrees of freedom are given in the last column of 
the ANOVA-table. 


aiei combinations (observations) in’ pairs and 
ite down in column marked (1), This fills up the top half of the 


Obtain column marked (2) in exactly the same way, using the 


results in column (1), i 
» Le, rati . e 
Preceding column, Operating on the results of th 


Continue this Operation until 
experiment ‘is reached 


l n 
the nth column for a 2- 


sq 


fora 


disad 
(i) 
(ii) 
(ii) 
(iv) 
(¥) 


(vi) 


(vi 


(i) 


SIGNS i 
awentaL Of d 463 
pf  qumn (n) corresponds to effects, the first value ws 
phe i of the entire experiment. Each remaining y, wil be the 


st ! . nd Iniz f e 
wee for Main Effects and Interaction Rafey, s T 
yarin is the number of replicates, 
y” 


yates’ Method for a 22-Factorial Experiment 


Treatment 
Combination 


a the treatment combination in that 
n 


order, 


< z each of the values given in column (n) and dividing the results 
su 


1, where r 1 
an illustration, the procedure is outlin 
s 


ed in the follow; 
9?-factorial experiment. my table 


(1) 


(1)+a+b+ab , 


Total 


a- (1) + ab =b A 
b a -— (1) b + ab- (1)-a B 
ab ab —b ab -b -a + (1) AB 


23.7.6. Advantages and Disadvantages. The advantages and 


vantages of a factorial experiment are stated as follows: 
A factorial experiment is usually economical. 
All the experimental units are used in computing the main 
effects and interactions. 
The use of all treatment combinations makes the experiment 
more efficient and comprehensive. 
The interaction effects are easily estimated and tested through 
the usual analysis of variance, 
The experiment yields unbiased estimates of effects, which are of 
wider applicability. 
A factorial experiment requires an excessive amount of 
experimentation when there are several factors at several levels. 
or example, a factorial arrangement of 8 factors, each at 2 levels 
requires 256 combinations and the number of combinations in 
case of 7 factors, each at 3 levels, would be 2187. To ie 
this difficulty, We use a device, known as fractional el , 
ere certain Properly chosen levels of factors are Or a in 
large number of combinations when used, Sart eet 
° efficiency of the experiment. The een vis considered of 
lite, manageable size by confounding some efte 
© Practical consequence. 


, istical analysis are 
he experimenta] set up and the resulting statistica 
ore complex. 


464 


23.1 


23.2 


23.3 


23.4 


INTRODUCTION To STATISTICAL THE 
Ory 


EXERCISES 


‘a) What is meant by an experimental design? Describe, in bres 
the basic principles of experimental designs, ei, 
(P.U., B.A./B.Se. 1985, 91) 


(b) Define and discuss the use of Randomization and Replication 


in designing an experiment. (P.U., B.A/B.S¢. 1988) 


Discuss the importance of Randomization, Replication and Loca] 

Control in designing an experiment. What effects do they haye on 

validity of conclusions and inherent errors of experiment? 
(P.U., M.A, 1961) 


(a) Discuss the purpose of replication in the experimenta] 
design. 


(b) Show mathematically how variance is analysed into two oe 


three independent parts corresponding to recognised sources 
of varis*ion of data, 


(a) Describe a Completely Randomized design, its model and 
analysis. What are its advantages and disadvantages? 
(P.U., B.A/B.Sc, 1992) 


(b) The following table contains the body weights of calves at 8 
of feeding given to a 
each. The completely 
randomized design was used. Obtain the standard error of a 
feeding treatment mean for the data on body weights. 


134 168 
126 182 183 
109 135 172 


DESIGNS 
MENTAL 2 . 
pret! 13 15 14 14 17 45 16 “3065 


93.6 


23,7 


11 11 10 10 45 9 12 
10 18 12 15 44 13 15 
16 18 13 17 = jg l4 15 
12 12 11 10 12 yy 4 
In order to study the effect of storage Condition on the Moisture 
content of white pine lumber, five Storage Methods were 
investigated, with varying numbers of experimenta] units 


(sample boards) being stored under each condition, The data 
(observations in %) thus obtained are given below: 


1.1 
: 4 
State your null hypothesis, both in words and symbolically, about 


the storage conditions. Compute the analysis of variance and 
state your conclusions, 


Storage Condition 


Pita | s7ay 


The staff of a university newspaper is experimenting with 
different formats, A completely randomized expciment is 
designed to compare three different formats. Vclunteer pe 
in a journalism class are given one of three formats ang, : 7 ; 
reading the news stories, take a reading comprehension test. 
his gives the following statistics: 


90 4.37 
86 3.76 
11 8H 4.21 


— 


e F-statistic, and 


fnn. x th 
Perform the analysis of variance, calculate e mear. 


is that tn 
determine the significance level for the hypothos $ 
reading comprehension is the same for all form ign, its mo" 
sign, 
(a) Describe a Randomized Complete ae eadvantages! 
and analysis. What are its advantages ar 


466 INTRODUCTION To STATISTICA» 
; à 
(b) The analysis of variance for a RCB desi Ory 


ANOVA table shown below: 


Treatments 
Blocks 
Error 
(i) Complete the ANOVA table. 
(ii) Do the data provide sufficient evidence to indicate F 


difference among the treatment means? Test using q=0 01 


B are j a = 94 
fidence interval 


(iii) If the sample means for treatments A and 
and g = 12.1 respectively, find a 90% con 
for (pa — Hy). 

23.9 (a) Compare Randomized Complete Block experiments with 
Completely Randomized experiments, comparing their 
respective advantages and relative efficiency, with 
illustrations. (P.U., B.A/B.Sc, 1986) 

(b) Three varieties A, B and C of a crop are tested ‘in a 

randomized block design with four replications, che layout 
being given in the diagram appended. The plot yields in 


‘pounds are also indicated therein. Analyse the experimental 
yields and state your conclusions. 


Replications 


23.1 (P.U., B.A/B.Sc. 1984) 
= T e ye complete block design, in each of four blocks |, 
s iyi IV, four varieties of wheat A, B, C, D are grown in the 

yout given below and the yields are also indicated therein: 


(a). Perform the analysi 


sS of varia : ificance 
level, the ciftar f nce to test at 0.05 sign 
» the differ ences in the yields of varieties and in blocks. 


EN 
ERM 
En Produced thy 4 


| 23.13 


TAL DESIGNS 
what would have been the result if i 


467 


ai blocki ; 
if we consider it as a completely ocking haq been 


done, Łe» randomize 


d . 
(P.U., B.A/B go design? 


- 1986, 96) 
hardening steel, 
P of steel was cut 
random, forming 


` six com atments and data 


are shown below: 


Cc C B 

713 814 | 759 
C 
437 


(a) Perform a two-way analysis of variance and test the 
significance of treatments. . 


b) What would have been the result if no blocking had been 
done? 


23.12 (a) In a randomized block experiment b blocks are divided into p- 
plots each, and p treatments are applied to one of the plots in 
each block. If Vy be the observed value of the jth treatment 
in the ith block, set up an appropriate analysis of variance 
table. 


(b) For the data given below, test for the significance between 
the treatment means. 


and f4 are, used to 
o three blocks each 
Ids in pounds pêr 


Four different kinds of fertilizers fy, fo fs 

Study the yield of beans. The soil is divided int 
Containing four homogeneous plots. The yie tlh 
acre and the corresponding treatments are as follows: 


Block 3 
Block 1 Block 2 Block 


B 
598 


=enTAL DESIGNS 
M 


ion for estimating a missi : 
eR 3 expression ti Obu 
468 INTRODUCTION To STATISTICA . g obtain a block experiment (when a single piyi > a i 
(a) Conduct an analysis of variance using the rand ` pi ee at you will analyse the reconstructed data, B 
complete block model. Mig splain center 
ee eee ih Make (a) Show that x = (k-1) (r-1)’ where T and B are the total 
H “7 e. . i a 
a T ae Oy fy) vy a ields for the treatment and block with the missing plot, and 
Mo fa); Gi) fy versus f3. y 5 rand total of observed values in th 
23.14 - Four treatments are given in four blocks in a G is the g 


23.15 


(a) 
(b) 


(c) 


(d) 


l e (r x k) 
Randomized Biog 


Design as tabled below: 


randomized block design. 


: he missing observation x and c 
) Determine t 
analysis of variance. 


ompute the 


II 
Il 


904 1840 1564 


== 18 Estimate the missing observatiòns in the air Be 
Correction Factor = 1642883 UB, randomized block experiment so as to perform the analysis o 


Total Sum of Squares = 425314 


Perform the analysis of vari 
treatments. 


variance. 


ance and find out the F-ratio for 
(P.U., B.A./B.Sc. 1980) 


Given the following abbreviated analysis of variance for a 
randomized complete block design: 


9 
3 
7 


Blocks 
2 


Treatments 
Complete the analysis; fi 
Compute the Standard e 
difference between 2 tre. 


Treatments 


Blocks 


16.2 141 130 136 

11.7 Yə 12.9 yų 169 125 

15.4 166 155 20.3 184 216 

Ya 18.6 12.7 15.7 165 18.0 
4 J 


- ed to estimate a 
219 (a) Explain how analysis of covariance o 
missing observation in experimental desig 


Error 


e the missing 


: : estimat 
ll in the mean squares. (b) Apply the covariance technique to the treatment 


: lso obtain 
fi t mean and for the value, m from the following data. A 
rror for a treatment mea 


re. 
Mean square and the error mean squa 
atment means. pe m 
The treatmant means are 1.464, 1.195, 1.325 and 1.662. W- 

mean or means do 


r rent 
; you suspect might represent diffe 
F-pulaticn? i 


! I 
Estimate the efficiency of this design relative to complet 
randomized design. (P.U., B.A/B.Sc. i 


470 INTRODUCTION To STATISTICA, : 
23.20 (a) Describe a Latin Square Design and it analysis, Wh HEony | 
advantages and disadvantages of a Latin Square desi are the | 
Bn? 
(P.U,, B.A/Bg, 199 i 


23.21 (a) Define a Latin square design. Explain the diff 


(b) Prepare a layout for 6 varieties of wheat in a Lati 
design. "Suan 
erence he 
en 
Complete blo 
(P.U., B.A/B.ge ise 
(b) Why is a Latin square design sometimes r 
double blocking? 


(c) What restrictions on randomness are inv 
design? 


a Latin square design and a randomized 
design. 


23.22 (a) What is a reduced Latin square? 


(b) Carry out the analysis of variance for the followi 
square. 


ng Latin 


V3 (3.0) 
V3 (4.1) 
V, (2.5) 
V, (2.0) 


V, (2.5) 
V, (2.4) 
V3 (2.9) 
Va (4.4) 
(P.U., B.A./B.Se. 1919) 


V3 (8.1) V, (2.4) 


V, (2.1) 


23.23 The atmosphere in 4 different districts of a large town was 


sampled, the samples being taken at-4 different heights. Four 
different tests for the presence of a certain chemical were made 
on the samples. The arrangement is shown in the following table 
with the Percentage by weight of the chemical as determined by 
the tests. Letters denote the different tests. 


Heights 


Is there evidenc 
and between hei 
the atmosphere? 


sof significant variation from district to distre á 
ghts in the percentage of the chemical present” 


E geer! 


eferred to as, | 


olved in als | 


93.24 


riab 
ne design was used and the following re 


23.25 


23.26 


AD no manure; B: an inorganic manure; 
-yard manure. ' 


Plan and Yield of Sugarcane (in suitable unis 
follows, 


g 


MENTAL DE 
An experi 
to abrasio 
used in 


SIGNS 


ment was conducted to assess the relat; 474 
n of four grades of leather (A, B, € aia Tesistances 
which the samples could be tested in apie Was 

ositions. Since different runs (replications 


ja k Y One of 4 
. re ; 
le results, it was decided to make nown to yield 


four runs. A Latin 
sults were obtained, 


Positions 


(A) 150 
(D) 130 
(C)'133 
(B) 98 


(B) 145 (D) 130 (C) 133 
(C) 172 (A) 170 (B) 127 
(D) 132 (B)115 (4) 170. 
(A)171 (C)132 (D)120 


Perform the analysis of variance and test the significance of the 
grades of leather. 


Four machines were tested in a 4x4 Latin Square design to see if 
they should be adjusted to produce more uniform product. If only 
one adjustment can be made, which machine should be adjusted 
and by how much? Justify your recommendations by completing 
the following analysis? 


Source of Variation 


Rows (Operators) 


Mean Square 


Columns (Time Periods) 


Machines (Treatments) 


Error 


73 


1 i 
0.00 


Pa | 
44.75 | 4 


(P.U., B.A./B.Sc., 


43.00 l 
Hons. Part M, 1964) 


ken from 4 
The following is a 5x5 Latin Square for the satan f ma 
manurial experiment with sugarcane. The five 
as follows; 
= C, D E: three Jevels of 
farm 


) per plot are 85 


INTRODUCTION TO STATISTICA] » i 
472 CAL THEY 


Columns 


Anelyse the above data to find out if there are any differ 
among the treatment effects at 5% level of significance, 
(P.U., B.A/B.Se, 1985) 


N 


23.27 An agricultural experiment was conducted on the Latin square f 


plan to test the effect on yield due to change of treatment 
(5 kinds) and also to variation of soil in each of two perpendicular 
directions. The results are set out in the Latin square below, in 
which letters correspond to treatments, while rows and columns 
correspond to the two perpendicular directions. 


(a) Perform an analysis of variance and find out if the effects on 
yiéld are significant. 


(b) If no blocking had been done in rows and columns, what 
would have been the conclusion? 


23.28 The responsive of five monkeys to a stimulus under five different 
conditions during five periods consisting of successive weeks 
were observed according to the Latin Square design below. The 
numbers are the total number of responses and letters denote 


the conditions. Analyse the data, describing briefly any | 


assumptions that you make. 


Periods 


„menTAL DESIGNS 
E 


eriment was carried ; 473 
ore the exp out, it w. 
Be ition E would produce a greater respon. > MURht that 
c 


. sponse on 
than condition D. Do the data support this onic average 


e wish to conduct a field experiment to test th 

g of 6 varieties of wheat and have available 

sufficient for 36 plots. Give a layout and in 

artitioning of the total degrees of freedom 
experimental designs: 


p! 


€ yielding ability 
an area of land 
dicate the proper 
for the following 
(a) Completely randomized, 

(b) Randomized complete block, 


(c) Latin square. 


Indicate, by means of arrows, the proper F-tests for testin 
yariety differences in each design. (P.U., B.A/B.Sc, 1975, '87) 


13,30 Describe Latin Square design and its analysis, How does it differ 
from a Randomized Complete Block design? How will you 
estimate missing values in a Latin Square design? Estimate the 
relative efficiency of the two designs mentioned above. 

(P.U., M.A. Stat., 1960) 


93,31 (a) Given a Latin Square design with r rows, columns and 
treatments. Assume that the data for one plot were lost. 
Show that the estimate of missing observation found by 
minimising error sum of squares is 


_r(R+C+T)-26 
(r= 1) (r-2) 


where R, C and T are respectively the totals of row, column 
and treatment which contain the missing observation; and G 
is the grand total. 


(b) Supposing observation for treatment D in row pa pmi 
2 of question 23.27 were missing, estimate it for the purp 
of analysis of variance. l 


F i es, 
a) What is meant by a complete set of orthogonal poa- 
and in what circumstances might such e of side 
Construct a complete set of orthogonal Latin squa» 
4, 
: :- of variance for & 
(b) Indicate the main stages in the analysis of vit I, 1972) 


Graeco-Latin square. (P.U., B.Sc. Hons. P! 


474 


23.33 Construct a 5x 


23.34 


23.35 


23.36 


23.37 


INTRODUCTION To STATISTICA, 
THe i 
w 


5 Graeco-Latin Square experimen 
s in the construction, and give th ta 


indicating the step p j 
to this design. 


variance appropriate 
Comment on assumptions necessary for the analysis M 
use of the design. 
A composite measure of screen quality was made on D 
four lacquer concentrations, four standing times, four > usin 
C ' 
concentrations, (A, B, C, D) and four acetone concentration, lig 
B, y, 5). A Graeco-Latin square design was used 
recorded as follows; ” 


esi 
e analysa 


on the 


Lacquer Concentrations 
l 1 l 2 
ie 2 


Standing 
Time 


93.39 


CB (16) By(12)) Dd(17) Aa (11) 
Ba (15) Cd (14) Ay (15) DB (14) 
AS (12) Da (6) BB(14) Cy (13) 


Dy (9) AB (9) Ca (8) 


Do a complete analysis of these data. 


B (9) 


What are factorial experiments? What are their advantages over 
single factor experiments? Explain the terms Main Effects ant 
Interaction, and discuss how you will separate the degrees o 
freedom due to main effects and interaction in a 2x2 factorial 


experiment. 23.40 


Discuss the chief properties of a factorial experiment. Explain 
the meanings of main effects and interaction effects in a 
factorial experiment. Give a plan for analysis of data of such an 
experiment, (P.U., B.A/B.Sc. Hons. Part I, 1% 

i ial 


The data of the folowing table are from a 2x2 factor! 
experiment. Partition the treatment sum of squares into math 
effects and interaction component. Interpret the results. 


fn a S aa | 


6 21 

14 E 

8 1 20: 
it. 17 
7 l J a 


Replication 


pla 


Perform a complete analysis of the data and 
interpretation of the results. 


give your 


Eight treatments have been applied to a certain crop and their 
effects have been reflected in the ‘following table with two 
replications: 


Give a method of analysis and interpret your result. Discuss an 


estimate of error in this problem. 
(P.U., B.A/B.Sc. Hons Part II, 1963) 


An experiment was laid to see the effect of three fertilizers n, p 
and k at two levels each (i.e., applied and not applied). The eight 
combinations were randomized in each replication. Set up an 
analysis of variance to test the significance of Main Effects and 
Interactions from the data given helow: 


Treatments 


(P.U. B.A/B.Sc. Hons. 


INTRODUCTION TO STATISTIC 
ALT, 
0) 


Aen, 


SS ee 
19 20 16 17 25 19 93 9 
5 


10 16 17 27 21 19 29 5 


Evaluate the sum of squares for all factorial effects by (i 
method, (ii) Yates’ method. y (i) contras 
) Give the structure of the analysis of a 
n : 

square design. ce for a Latin 

(b) Set up analysis of covariance table for the followi x 
: 5 ow 

obtain the adjusted treatment means. ing data ax 


23.42 (a 


(D) 80 45 | (B) 75 44 
(D) 74 46 
(A) 75 41 


(C) 85 46 


(C) 71 44 
(A) 100 47 
(D) 84 46 
(B) 73 45 


(A) 88 46 
(B) 74 45 
(C) 73 47 
(D) 86 46 


(C) 84 48 
(B) 75 47 
(A) 86 48 


Ms a o% of 
ae ae ate ote eo o ofe of 080 of0 08 
Ro ae oo ogo oe g0 oo 00 09? 9o? Oo ao oo 


Oh o taa ona ana 
j ing data were obtained from a 23-factoria] toy | 
€Xpen. ~ 


24 


Nonparametric Tests 


24.1 INTRODUCTION 


Most of the hypothesis testing procedures presented in earli a 
ariier 


. chapters are based on certain restrictive assumptions about th 
u e 


distribution and parameters of the population( s) from which 

drawn. The most common assumption is that of normalit; a nen are 
are based on the assumptions that the sampled ‘a it ese tests 
approximately normal with equal variances, they are called aaa 


-tests. However, there are many situations, particularly in the social 


sciences, ‘where normality assumption is in fact not met and the 
parametric tests cannot be carried out. In such instances, the techniques 


_ that can be used, are called nonparametric tests. 


a | eA test is one that is performed without making any 
and ‘Ts eeepc about the form of the population distribution 
me parameters. A nonparametric test is also called a 
shape or “sg because it makes no specific assumptions about the 
population di sd the underlying population distribution except that the 
provide 2 gees 1S continuous. The nonparametric tests, which 
advanta ul alternatives to the parametric tests, have several 
ges, some of them are given as follows: 

e no assumptions about the form 
ulations. In other words, 
be drawn from norma 


(i) 
on nonparametric tests requir 
A parameters of the underlying pOP 
ey do not require that samples 


Populations with specified parameters. 
ely easy t0 understand. 


ally simple and 
aramettic tests. 
tive data oF 


T : 
he nonparametric tests are relativ 
e computation 
d with the p 


T À i 
o nonparametric tests may be applied to qualita 
that can be ranked but not measured exactly. 


Ti 
. ane Nonparametric tests ar 
atively quick to apply as compare 


477 


TRODU 
498 INTRODUCTION To STATISTICA 7 
However, 


. R 
the nonparametric tests are generally less efficient “Ohy 
a 


do not use all the information ap in the sample ie ty 
disadvantage is that they require a larger sample size A sy 
corresponding parametric tests. . : 
There are several nonparametric tests, including TN 
tests of goodness-of-fit and of independence in contingency table 
discussed in Chapter 17. The rank correlation coefficient that a 


the strength of relationship between two sets of ranked data, iş also 
e 


nonparametric alternative to the simple coefficient of correlation 


Square 
S Alteaiy 


24.2 THE SIGN TEST l 

The sign test is perhaps the simplest and the oldest nonparamety 
test in use. As its name implies, it is based on the signs (pluses a 
minuses) of the observed differences. It is used to test the n 
hypothesis that the probability of a + sign equals the probability tas 
sign, ‘which is equivalent to testing the hypothesis, in case of one sample 
that the population median M, has a specified value, say M n, betau 
each observation is equally likely to be above the median as plow it; and 
in case of two samples, that the two populations are identical (which 
implies that the two populations have the same distributions and share 
the same mean and variance). 


(i) One Sample. To perform the test, we replace each observation 
by a plus sign or a minus sign depending upon whether the observatien 
is above or below Mg, the hypothesized value of the population median. 
We discard any observation that equals My and reduce the sample size 
We denote the total number of plus and minus signs by n. The test 
statistic X is defined by the number of times the less frequent sign (pis 
or minus) occurs. Under our null hypothesis, the sampling distribution 


of. X is- binomial with parameters ; and n. We determine the ert 


region by calculating the binomial probabilities. To reach the significan 
level &, we add the probabilities from both tails in case of two-tailed ea 
and in case of one-tailed test, the probabilities in the desired tl? 
added to reach a. We accept or reject the hypothesis in. wt 
‘manner. The null hypothesis in case the populations are sy 
be stated as Hy: 4 = po. 


mmettic 


i 
ji ; m 
(ii) Two Samples. Let X ; and ¥; denote the observations 3 rf 


fi i 
irst sample and the second sample respectively. We rep y, 04 


difference X;- Y; by a plus sign if X; > Y; by a minus sign if Xi a ihe 
a o the pair if X; = Y;, i.e., zero differences are dropp? nd yet? 
- analysis. Let n represent the number of plus and minus signs ê 


"an Then the sampling distribution of X is bino 


my. f 


pRAMETRIC TAE 
no number of times the less f : 479 
3 for the Tequent sign (plus or minus) 


mial wi 

Lynd The rest of the procedure is the same as in one iy 
b gi sample sizes are not equal, some of the val ign test, 
<a are to be discarded (The data must be from match d pai 
Hy: ba, * u when the underlying populations are assumed to be 
symmetri. 

. with large n, we use tne normal approximation to the binomial 


1 «gs s 
jistribution b(n, 9). The statistic X is then approximately standard 


ues of the larger 


. sie 7 pas n 
normal with mean = > and standard deviation = ft. In other words, 


statistic under Hp, becomes 


the test- 
—n/2 3 i 
Z= X-n/ , without correction for continuity, 
afn /4 
n 
Has) siren 
Et 


, with correction for continuity. 
\/n/4 

We reject or accept Ho, applying the usual decision rules. In 
applying normal approximation to binomial distribution, n is taken large 


_ when both np and nq are at least 5. As p = D we can therefore use the 


normal approximation when n exceeds 10. 


Another appropriate statistic to test the hypothesis Hy: P [+ sign] 


= P[-sign] = Fis the chi-square statistic, given by 
2 

ny-n 
2 a) with 1df 
ny + no n 
= nı and no represent the number. of plus and aa 
a avay, The sign test is simple and easy to apply. It's : the t- 
Wien where the Student’s t-test is not applicable. a ‘ate sign 
test i applicable, is more efficient and more poen rs only the 

.. another disadvantage of the sign test is that it conside 


sign : 
= me differences and not their magnitude. 

h Reon 
asa * Procedure for testing the hypothesis that t 
(i) 


he population median 


Specified value M o in case of one sample, is given ye 
Formulate the null and alternative hypothes? 


Hg: Population Median M = Mo 


sas 


INTRODUCTION To STATISTICAL | 
THEo 


480 
‘ lation Median M # Mg (or M > My orm < , 
H, : Popu ‘ level My). we specify the significance level at & = 0,95 
tii) Decide on aignitioanee evel Q. | cane canes os 
(iii) The test-statistic is X, the number of times the less frequent y ei sign occurs. $ times the less 
(+ or —) occurs and is binomially distributed. sin frequent ® l 
, Computations. Subtracting 30, the hypothesized S 
If n, the number of pluses and minuses exceeds 10, the te from each observation and writing down the Signs, we ~n 
statistic would become " À a 
3 i » 4 i i i 
a 7 nie . . , = 
Z= = 8 as zero is ignored and X = 2, then g nes 
Now 2 ‘ _& the number of plus s 
. ~/n/4 y frequent). Under Ho, the sampling distribution of x is anai 
which, if Họ is true, is approximately standard normal, o käk 
i with P = 2 
(iv) . Computations. Subtract Mo, the hypothesized value of PoP tn 


median, from each observation of the sample, i.e. fing th + PS 2) = o ie a(2)° z ON - ot ois 
differences X; — Mo. Write down a plus sign if the difference is 
positive and a minus sign if the difference is negative. Ignore zer 
differences, if any. Denote by n the number of plus and mix 
signs (i.e, non-zero differences) and by X, the number of time 
the less frequent sign occurs. Compute either the extrem 
probabilities of the binomial variable X or the value of Z, as the 


case may be. 


Critical region. For a two-tailed test, to rejact Hp, the computed 
probability should be less than 0.025. 

Conclusion. Since the computed probability is more than 
a/2=0.025, we therefore accept Hy and conclude that the median 
of the population equals 30. 

Example 24.2. An experimenter wants to determine the 
 efectiveness of a certain reducing diet. Twelve persons were put on diet; 


(v) The critical region depends on the test-statistic, alternative 
_their weights before and after they tried the diet, are shown below: 


hypothesis and significance level a. 


(vi) Apply the usual decision rule to reject or accept the null 
hypothesis. 


Person 
Weight before} 202 154 183 180 228 164 139 165 175 245 237 162 


| Weight after 


The procedure in the case of two-sample sign test would be the 
same except the following two steps: 


' (i) Hg: The two populations are identical or that they have equal 
medians, M} = Mb. It is tested against an appropriate alternative 
hypothesis. 


(ii) Computations. Subtract each observation of the second sanp 
say Y;, from the corresponding observation of the first g 1 
say X; i.e. find the differences X; — Y;. Write a plus sign : i 
difference is positive and a minus sign if the al 
negative. Discard zero differences, if any; and so on. 


4 tth 

Example 24.1. Use the sign test to test the hypothesis aii 

median of the Population from which the following data 1S i 90, 2 

sample, equals 30 against the alternative that it does not: 27, m 
32, 24, 25, 29, 26. - 


195 154 178 199 220 157 135 180 198 206 227 155 


th Use the sign test, at the 5% significance level to test the hypothesis 
atthe diet is not effective against the alternative that it is effective. 


) The null and the alternative hypotheses are 
Hy The diet is not effective, which is equival 


1 
hypothesis Hy : P [+sign] = F [-signl= z and 


ent to testing the 


H: The diet is effective, i.e. p > 5, (one tailed test) 
The sion; ; 
è significance level is set at a = 0.05 pi 


. of times 
The test-statistic to be used is X, the number x Tiu 
“equent sign occurs. Under Ho X is binomially distr 


G) We set up our hypotheses as 


INTRODUCTION T = 


“482 : : AMET 
ae racting the weights after fr pAR i 

.: (iy) Computations. rors A r acii from the Wej w g WILCOXON SIGNED-RANK TEST FOR = 

before they tried the diet, a) i a plus sign p 4,3 OBSERVATIONS THE 
ositive difference, and a minus sign for each negative ait ty, Vig P AIRED 
p Ereg i ; signed-rank test was 
te jlcoxon Sig : d Proposed b ` 
we get EEEE ede The e named after him. This test is applied : Frank Wilcoxon in 
+0+-++t A i = l and : sumption of normality is suspect. The aoe differences 
Thus n=11, the SF a et ah eden signs, ignoring 24 an sign test because it makes use of the te oe 
difference, and lie ; he distributi OS signs (less freque, A nces between paired comparisons. The null Bypothoals i 2 i 
signs). The’ statistic X is the distribution of ~Signs, therein i the two popvlations from which the respectiva ti Ming 
is 


paea- 2)! + 11 G)" + 55G)"" + 165)" 


232 


d pairs come, are identical. To perform the test, we r 
“ee er of the differences between the pairs, a 
= S048 = 0.113 jifferences, if any, and in the case of ties, assigning to each of such 

, , jifferences the average of the ranks that would have been assigned, had 

(v)  Critical-region. For a one-tailed test, the computed Probability they differed slightly. We then compute the sum of the ranks assigned to 
required to be less than 0.05. the positive differences and the sum of the ranks assigned to the negative 

(vi) Conclusion. Since the computed probability is more than 0y | differences. The test statistic, denoted by T, is based on the smaller sum 
so we do not reject Ho. of ranks with the sign ignored. If Hy is true, we expect the two sums to 

Alternative Method (a) Since n, the number of plus and mim, | be about egual. =. lanes MA NEnNG Baie = tha two sums provides 
signs, exceeds 10, therefore the sign test can also be carried out, Using a ih jeee in bee 9 oes the ae aa 

g p y f CN : , ti A value of T is G > er: e at the specified leve 
normal approximation to the d A a of significance, given in a table titled, "Critical values for the Wilcoxon 
$= 5.5 and o = npg =4 [11x7 Xņ = 1.66, we find Signed-Rank Test". (Table 24.1). It is to be noted that in Z, %2, t and F 
- tests, it is the larger values that generally provide evidence against the 


1 
X+5-n/2 F ; : ; i 
2 ; NN" > null hypothesis. The Wilcoxon signed-rank test is also applicable to test 
a "rect: 
s vn /4 s SER ey EOE the hypothesis that the median: of a population of differences equals 
85-55 i sme specified value, Mo, i.e., Hy: (X - Y) - Mo] = 0. 
>; ae” V7? With large n (say n > 25), it has been shown that the signed rank 


which is less than zg 95 = 1.645. Hence we cannot reject. Ho, the samt statistic T is approximately normally distributed with a mean 


conclusion as obtained by applying the binomial distribution. A 
(b) The sign test in this case can also be carried out by the method? 
chi-square. Thus we have 


-P(n + 1) 


Hp 4 , 


and a standard deviation 


Or _., [pint DG +1): 


Wh p A 

ia n is the number of matched pairs, excluding 7er 
c oii 
6 When n > 25, we generally use the test-statistic 


(nı = ng)? 
"Nyt Ne 
where n, and no denote the number of plus and minus sign. Subs 
gives 


2 


titution 
aiicarences. 


(8-3)? 25 
2 OO, Oe! ; 
x 843 ll 2.27 a(n +1) 


T- ur T 4 


Z= = —— 
Sr n(n + I) nso 
24 


u 
: 7 afore the” 
This vaiue does not exceed Xo o0) = 3.84 and therefor 
hypothesis cahnot be rejected. 


INTRODUCTION To l 
484 ; STATISTICAL 5 AR AMETRIC TESTS ; 
age is approximately standard norm fe aai Geitel Vanerne ean 
which, if Ho aam eal decision rules. al. We Aeon , Table 24.1. Crate 2l Values of 7 for the Wi] s 
reject Ho, applyiné ‘ i Signed-Rank Test Coxon 


is given below: Re 
The test-procedure 1s gl Two-tailec. “est: a4=0.10; @=0.05; a=0.02: 


(i) Formulate the null and alternative hypotheses as One-tailed Test: ae %=0.025; a=0.01: oe, ot 
H, : The two populations are identical, which is equival $ 0 E .005; 
testing the hypothesis that the medians of the two sie A 2 0 
are equal or the two means are equa) in the case of symmetge 5 : 1 
populations; and Tea 10 3 3 2 
Hy : The two populations are not identical or the two inedia | 13 10 s 
are not equal or one median is larger than the other, ete. ay 17 13 9 

(ji) „Decide on significance level a. = 5 12 
(iii) The test-statistic is T, the smaller sum of ranks with the sin 30 25 - 
'* >= ignored. 35 29 23 
(iv) Computations. Find the magnitude of the differences between 41 34 27 
the paired values. Arrange the non-zero differences in order i 47 40 32 
increasing absolute values (i.e., discarding algebraic sign), 53 46 37 
Assign the ranks 1, 2, 3, etc. to the ranked differences. If ties 60 52 43 
occur, then assign to each such difference the average of the 67 58 49 
ranks that would have been assigned if they had differed siighty, 15 65 55 
Assign to each rank the sign of the original difference. iind the 83 73 62 

sum of the positive ranks and the negative ranks separately. 91 81 69 

100 89 76 68 


_Ta'-:.the smaller sum of the ranks with the sign ignoved as T. (n 
case of large n, calculate the value of Z-statistic). 


(v) Critical value of T is given in the Wilcoxon si;,ned-rank tale 
‘ (Table 24.1) for the value of n (number > matched pas 
excluding zero differences) at the chosen leve) of significan? ¢ 
This is the value at or below which lies the critical region. 


(vi) Decide as under: 


Reject Hy if the calculated value of T < table value at the chosen 
value of &.n is the number of non-zero matched pairs. 

Example 24.3. Ten young recruits were put through a strenuous 
N sical training programme by the Army. Their weights were recorded 
elore and after the training with the following results: 


. . i ‘ 
Reject Hq if observed value of T < the critical value, spat 
accept it. (In case of Z-statistic, rejec or accept Hy apply"? 
usual decision rule). 


is that 
Use the Wilcoxon signed-rank statistic to test the hypothesis 


e br ; = 0.05. 
Programme affects average weights of recruits. Leta = 0.0 


A o give? 
The Wilcoxon Signed-Rank test is alsc applicable when We are 8 
(i) 


t 
a'random sample and we wish to tes: the null hypothesis i 
` Population median is equal to some specified value, Mg. In this aiat My 
matei every observation in the sample with the hypothesized va nf 
and find the differences X-M o: If tke null hypothesis is true, t E jst 
„the signed ranks should be close to zero. The rest of the proc i 


same e ified j 
*S specified in the case of the paired differences. 


We state our null and alternative hypotheses 45 


“weight of 
Hy welg 


‘The programme does not affect the average 
recruits, i.e. My — Hg = 9; 

H ; m 
The programme affects the average W 

Hi = hg #0. 


eight of recruits, L6 
vu! 


INTRODUCTION TO STATISTICA 
T 
0) 


(ii) The significance level is set at Œ = 0.05. 


The test- 
ignored. 


(iv) Computations are shown below: 


Weight 
before(X) after O) 


(v) Looking in the table for the Wilcoxon signed-rank test (Table 


24.1), we find that for n=10 at @=0.05 for Two-tailed test, the 
critical value cf T=8 at or below which lies the critical region. 
(vi) Conclusion. Since the observed value of T=10 does not lie in 
the critical region, so we cannot reject our null hypothesis. The 
data do not provide sufficient evidence to indicate that ™ 
programme affects average weight. 


Example 24.4. Given that the nine observation 
random sample from a continuous, symmetric popula 
Wilcoxon signed-rank test, at the 5% level, to test the nu 
that the median equals 30 against the alternative that it is less: 


27, 39, 30, 22, 32, 24, 25, 29 and 26. - 
(i) We state our null and alternative hypotheses as 
H: Population Median, M = 30 and H} : M < 30. 


s below are ® 
tion, use w 


(ii) The significance level is set at a -- 0.05. 


Gii) The test-statistic to be used is T, the smaller sum o 
the sign ignored. 


f ranks wt 


statistic is T, the smaller sum of ranks with the 
Sig 


11 hypothe". 


Signed Rank 
O O 


œo w 


mera nnwna / 


Looking in the table for the Wilcoxon signed-rank test, we find 
that for n=8 (zero difference is ignored) at a=0.05 for one-tailed 
test, the critical value of T=5, at or below which lies the critical 


“region. 


w) 


vi) Conclusion. Since the computed value of T=10 is larger than 
the critical value, so the null hypothesis that the population 
median equals 30 cannot be rejected. 


44 THE WILCOXON RANK-SUM TEST FOR 

INDEPENDENT SAMPLES 

The Wilcoxon signed-rank test for matched pairs cannot be 
verformed when either the two samples are independent or the two 
‘mples have different sizes. For either of these situations, another non- 
Parametric test, known as Wilcoxon rank-sum test is used. The bull 
pothesis to be tested is that the two samples come from identical 
rmlations or that the medians of two populations are equi» Act 
Sum Suitable alternative. The test-statistic is denoted by pelt le To 

of the ranks assigned to the observations of the smaller samp © 


tar independent 
Aha the test, we combine the observations of the two mene 
a of size ny and No (nı < no) from two popelatiOs ee in 

‘ous and symmetric. We arrange the n; + [2 obser 


Er of incr : j í 2, ea M1 A 
creasing magnitude and assign the oe K a ranks the ties 


-n 

ths tva ion e case of ties, assign the average ° : 

atistic. 

hypothesis. ie 

5% level of sig? 
we re +t Ho 


’ 


INTRODUCTION To st. NS 
c 


e of R is less than or equal to the smaller 
to the larger table value for the g 


488 
computed valu 
greater than 0 


n, and ng f 
1 It has been shown that when either sample size (or bot 


10, the statistic R is approximately normally distribua 
ed w 
With 


iy 
table y § 
r equal iven tne! | 


exceeds 


mean 
Ny (my + Ry + 1) 


HR = 2 


iati nyna(n + ng + 1) 
and standard deviation OR E 


“he test-statistic would then become 
| 7 pam + me + D 
R-HR_ é 


Z= = , 
or q frame (my + mg + D 
12 


which, if Ho is true, is approximately standard normal. We accept » 
. reject Ho, applying the usual decision rules. It has been further shom 
that, if the two populations are normal, the Wilcoxon Rank-Sum testi | 
almost efficient as the two-sample t-test. 


` Table 24.2 Critical Values of R for the Wilcoxon Rank Sum Test| 


In this table there are four numbers for each value of n; and Ny. The to 
pair are the critical values for a two-tailed test for a=0.06, while the 
lower are the critical values for one-tailed test for a=0.05. 


2G 40 | 21 44 | 23 4 
28 50 | 29 55 
ae 
39 66 


-The two samples come from identical 


. 0 . i 
medians of two populations are equal; Populations, or the 


Hy: The two samples do not come from identical populations, et 
l i ) ete, 
In the case of one-tailed test, we may state that M, > M,, ete 
PA . 


ti) Decide on a significance level Q (generally =0.05), 
(ii) The test-statistic to be used is R, the sum of the ranks of the 
smalier sample. In case of equal sample sizes, either rank sum 


can be used. 


Computations. Combine the two samples and arrange the 
observations in increasing order of magnitude. For convenience, 
underline the observations from sample 1. Assign the ranks 1, 2, 
wy Ny + Ng, With ties being assigned the average of the ranks that 
they occupy. Find R by adding together the ranks of the 

' underlined observations, (when n} and ng exceed 10, pyitulate: 
the value of Z statistic). , 


(iv) 


(vy) Critical values of R ‘for ©=0.05) are looked in Table 24.2 for the 
given values of n, and ng. 


(vi) Decide as below: 


Reject Ho if the calculated value of R < the smaller table value or 


R >the larger table value, accept H i otherwise. 
w are from two populations, 


Example 24.5. The two samples belo 
i thesis, using the Wilcoxon 


assumed to be identical. Test the null hypo “i 
rank-sum test, that the two medians are equal against the alterna r 
that the median of the first population is greater than that of the seco: 
Populaticn, 


38, 49, 45, 29, 31, 35 


31, 42, 22, 26, 43, 31, 25, 30, 47 


The null and alternative hypotheses are 
Ho: M, = M, and H}: M; > Mo 

where M, and Mo are the medians 
Population respectively. 


of the first and second 


= 0.05. 
We choose the significance level at a = 00 


le 1. 
nks of samp 
The test-statistic to use is R, the sum of the rā ; 


j 


INTRODUCTION TO ST, 


(iv) in order of increasing magnitude and un daa ample 
observations from sample 1 (a good way to distinguish oan the 
get À We 
22, 25, 26, 29, 30, 31, 31, 35, 37, 38, 42, 43, 45, 47, 49, 
Assigning the ranks to these observations (underlinin 
of the observations from sample 1), we get 

1,2, 3, 4, 5, 6.5, 6.5, 8, 9, 10, 11, 12, 13, 14, 15 


Adding the underlined ranks, we get R = 56.5. 


E the rank, 


(v) Looking in Table 24.2, we find for nı = 6, ny = 9, the sles 
region which consists of all values of R 2 63 (one taileg te 
lower pair). , i st, 


; (vi) Conclusion. Since the computed value of R = 56.5 does not fa 
in the critical region, we therefore cannot reject our null 
hypothesis. 


Example 24.6. Given the two samples below, use the Wilcoxon 
rank-sum test to test the null hypothesis that the population medians 


are equal against the alternative that they are not equal. Let a=0.05 
Sample 1 | 40, 35, 44, 42, 46, 28, 39, 50, 37, 45, 27, 35 
Sample 2 | 32, 34, 49, 37, 48, 49, 48, 51, 45, 44, 34, 36, 50, 49, 37 
(i) We state our null and alternative hypotheses as 


Hy:M,=M, and H,:M,#Mbo, 


where M, and M, are the medians of the first and second , 


population respectively. 


(ii) The significance level is set at & = 0.05 
(ili) As both n} and ng exceed 10, the test-statistic to use is 
R- 
as 3 
OR 
which, if Hp is true, is approximately standard normal. 
etl ions. Arranging the observations of the comb 
en in order of increasing magnitude and underlining 
ntification) the observations from sample 1, we get 


= SL) 04, 04, 34, 35, 35, 36, 37, 37, 27, 39 40, 42, 44 
46, 48, 48, 49, 49, 49, 50, 50, 51 » el, OY, BY, 24, 25 


ned 
(for 


(iv) 


44, 45, 45, 


TESTS 
: ATISTICA: pMETRIC 
490 : CAL THE. E onra i . ; 494 
opdat: Arranging the observations of combineg EORy 0 nest acing these observations with their ranks (with the Tanks of 
s 0) 


(vi). 


, Hote it by R. We then find, for both samples, t 


pservations from sample 1 underlined), we get 
0 
4.5, 4.5, 6.5, 6.5, 8, 10, 10, 10, 12, 13, 14 155 


3, 
te 19, 20.5, 20.5, 23, 23, 23, 25.5, 25.5, 27, > 15.5, 17.5, 
Adding the underlined ranks, we get R = 142.5, 
Now 17> 12 and ng = 15, therefore 


ny (my t Mgt 1) _ 12 (12+ 1541) 
iy lg Se Oe a Se EO 
Hr = 2 2 


nyng(ny+NgtI) (12) (15) (12 + 15 + 1) 
CIAN 12 ~oa E0. 


R-ur _ 1425-168 __ 
paa 20.49 7714 


= 168, and 


(vy) The critical region is |Z| 2 1.96. 


Conclusion. Since the computed value of z = -1.24 does not fall 
in the critical region, we therefore cannot reject our null 
hypothesis of equal medians. 


24,5 THE MANN-WHITNEY U TEST 


The Mann-Whitney U test is a nonparametric alternative to the 
Student's two-sample t test which requires random sampling from 
normal population with equal variances. The test is based on ranks and 
isused to determine whether or not two independent samples of size n; 
and ny come from populations having identical distributions. The null 
hypothesis to be tested is that the two populations are identical. To carry 
ut the test, we arrange all the n, + ng observations of the combined 
şmples in order of increasing magnitude and assign the ranks 1, a 
t+ ny to them. In the case of ties, we assign the average of the : 
tanks, We add the ranks assigned to observations in sample 1 and denote 


this sum imitan f ranks of sample 2 and 
by Rj. Similarly, we calculate the sum 0 ank ios of U, (he 


statisti 
latistic used in this test) as below: 


U,=nyng+ my Grin + Ye a, and 


1 i 
Wenara M 
d for Uy and Up 25 t 
st ie. U = min (Up a 


Jue of U is either $ the 


We 
ch 
Value ies the smaller. of the two’ values foun 


We ” U statistic for the Mann-Whitney te: 
Pia hypothesis if the calculated va 


INTRODUCTION To STATISTigg « 
ALT 


paveTRIC Toe 493 
yor? A 41-60, 61-80, over 80, etc. In such a Situation, we 
> We regard 


492 i 
; larger value given in HEQp 
ry value or 15 > the a table, tj Y ; : ; 

i Te of U for the Mann-Whitney test. (Table 24.3) titled: On } a jons grep ja pe category or group as ties. W 
vain ig interesting to note that the sum of two U's will alwa ae a each category or group by the 
to the product of the two sample sizes, i.e. U} + U, = n YS be og, we sion 

l , ni ng , 1 Ma and ty r! T.+ 1 
smaller value of U is always less than —5—. This provides a good j r; =z +C, where 
; , , Od cheg 

on the correctness of the rankings. In practice, we need f the jth catego 
either U, or Uz, the other can be obtained from the relation to calculat j” average rank 0 J l g H or group, 
it has been shown that, when Ho is true, and n; arid ng are both abong, q, total number of observations in the jth category or group, and 
than 8, the statistic U is approximately normally distributed with — H _ cumulative total of the category preceding the jth category 

ny Ng g i 

Hye- We compute the sum of ranks for each sample by multiplying each 


avera 
poced 


RPA fn Ng (ny +Not1 
and standard deviation Oy = sa Raa 


The test-statistic would then become 
U -Hy 


reee 
’ 


used. 


which, if Ho is true, is approximately standard normal. 
wW ‘ ae a: 
ao 7 at a accept or reject the hypothesis by applying the 
Table 24.3.'Critical Values of U for the Mann-Whitney Test 
In has table there are four numbers for each value of n, and ng. The tep 
pair are the critical values for a two-tailed test for a=0.00, while the 
lower pair are the critical values for one-tailed for @=0.05. 


repr 


(i) 


The Mann-W 2) if 2 xk 
contingency tabl hitney U Test is also applied to an order? ankod 
categories or gr e in which two samples are classified into pat 
= groups such as poor, fair, good, very good °° a 


Plastic 1 
Plastic 2 


Use the Mann-Whitney U test to test the hypothesis of no difference in 


the distributions of strengths for the two plastics at the o.=0.05 level of 
significance. ' 


Tota es 
otalling the ranke of the two samples s°P® 


e rank by the number of observations in that rank. The rest of the 
ure is the same. 
The following exé mples illustrate how the Mann-Whitney U test is 


| Example 24.7. Two plastics, each produced by a different process, 
were tested for ultimate strength. The measurements shown below 
esent breaking load in units of 1000 pounds-per-square inch. 


15.3, 18.7, 22.8, 17.6, 15.1, 14.8, 
21.2, 22.4, 18.3, 19.3, 17.1, 211, 


We state our null and alternative hypotheses as 
Hy: There is no difference in the distributions of st 


H,: The two distributions of strength are different. 


rength, and 


The significance level is set at a=0.05. 

o values. 

tions of the combined 
de and underlining the 


The test-statistic to use is U, the smaller of the tw 


Computations. Arranging the observa 
Samples in order of increasing magnitu 
observations from sample 1, we get 

148, 15.1, 15.3, 17.1, 17.6, 18.3, 18.1, 


Replacing these values with their 
Servations from sample 1 underline 


1, 2, 3, 4, 5, 6, 1, 8, 9, 10, 11, 12. 


rank 
d), we get 


ately, w 


INTRODUCTION To STATISTICAL . 
THe 
ORy 


5+7+ 10 = 28, and 


Ry=1+2+83+ 
pagtapirpt 12 80, 
a= 
nı (nı + 1) 6x6 + OM 
Now Unnt 2 mi a T28 = 29 ang 
ng (na + 1) = (6) (7) 
U,=nyntgt——9 Se ae -50=7. 


The smaller of the two values for U, and U, is taken as m 
statistic, i.e. U = min [29, 7] = 7. - 

(v) Looking in Table 24.3, we find for n, = 6, n = 6 and g = giy 
the critical region consists of all values of U < 5 and of al] values 
of U 2 31. 

(vi) Conclusion. Since the computed value of U=7 does not fallin 
the critical region, so we are unable to reject the hypothesis of no 
difference in the distributions of strength. 

Example 24.8. Given below are the grade point averages received 
by two groups of students. 


Group 1 | 3.1, 5.8, 6.4, 6.2, 3.8, 7.5, 5.8, 4.3, 5.9, 4.9 
Group 2 | 9.0, 5.6, 6.3, 8.5, 4.6, 7.1, 5.5, 7.9, 6.8, 5.5, 8.9 
Test by the Mann-Whitney U test the hypothesis that the two 


groups come from identical populations. Use a 0.05 significance 
level. 


(i) We state our null and alternative hypotheses as 


Hy: The two groups come from identical populations or 
equivalently, the grade point averages are equal; and 


Hı: The two populations are not identical. 
(ii), The significance level is set at œ = 0.05. 
Gii) The test-statistic is U, which becomes 


nno 
z- Vu U 2 


as nı and ng are both greater than 8; and Z is approx! 
standard normal. z 


mately 


j fal aie pined 
ty) vomputetions. Arranging the observations of the ri 


à ining t 
groups in order of incr easing magnitude and underlining 


observations from Group 1, we get 


3.1, 38 4.3, 4.6, 4.9, 5.3, 5.5, 5.5, 5.6, 5.8, 5.9, 62 
3 6.4, 6.8, 7.1, 1.5, 7.9, 8.5, 8.9, 9.0. 


$ corresponding ranks are 
e P 


T 
122456 7.5, 7.5, 9, 10, 11, 12, 18, 14, 15, 16, 17, 18, 
19, 20, 


talling the ranks of the two groups separately, we get 
Gp 2lTe ser ae ee 10 + 11+ 12 + 14 + 17 = 81, and 
Ry = 447.5+ 7.54 9418415 + 16 + 18 + 19 + 20+21= 150 
ow the value of the test-statistic U is the smaller of 


To 


N 
nı + 1) 10 
, 
tur a aaan ND ma 
U = mire + 2 a e ne 


U, = nyng- U, = 110 — 84 = 26 
thus -U = min [84, 26] = 26. The mean and standard deviation of 
thesampling distribution of U are 


(10) (11) 


or 


nyg = 
W=- = > = 55, and 
Ny ng (n; + ng + 1) (10) (11) 10 + 1+ D _ 49 
Oy= 12 = 12 
y- E : 
fines z= 0... 2 BS L 2.04 
Sy 14.2 
M) The critical region is |Z| > Zo.o25 = 1-96 
' in the 
(vi r ; = —2.04 falls in t 
i) Conclusion. Since the computed value ore that the grade point 


critical region, so we reject Hg and conclude 
averages differ significantly. 


om 24.9. The following table shows the scores 7°° 
ĉmale students in a certain standard test: 


eived by male 


INTRODUCTION TO STAT 
Is 
496 TICAL Th 


(i) 


We state our null and alternative hypotheses as Eon 
= ; e. the mean scores of male (u, ) 
Ho: = Up Le. en 
j vm gents are the same; and male wy 
Ho: Un + Hy n 
The significance Jevel is set at & = 0.05. 


The test- 


(ii) 
(iii) statistic is U, which becomes 
a 
Z= ——, 
Su 
as n, and ng are both greater than 8, and Z is approximate 
standard normal. ely 


Computations. Regarding all observations in each category 

group as ties, we determine their average ranks. We compute bi 

sum of ranks for the sample of male students by multiplying each 
` average rank by the number of observations in that rank. These 


computations are shown below: 
Pie [2 | 
n | s 
, B44 
Average Rank 2° 
: i i = 82 
Total of ranks for} 2.5x3 | 12x8 |40.5x17|69.5x12| 82x6 
male! =7.5 | =96 | =688.5| =834 | =492 
) 


nı (nı + 1) 46+1 
=e = (46) (40) + soe au 


(iv) 


Now Uj, =nyn.+ 


= 1840 + 1081 — 2118 = 803, and 
Uy = nny — U} = 1840 — 803 = 1037 
U = min [U], U4] = 803 
The mean and standard deviation ofthe sampling distributio 


uy = 2122 _ (46) (40) 


2° 3 


n of U are 


= 920, and 


497 


ia = 
12 12 


—1.01 

pe = 
) The critical region is |Z| 229,095 = 1.96 

clusion. Since the computed value of z = —1.01 does not fall 


e critical region, So we cannot reject Ho. We conclude that 
res of male and female students are the same 


Con 
in th 
the mean Sco 
THE MEDIAN TEST (TWO OR MORE SAMPLES) 


The median test is a nonparametric test wsed to determine whether 
ent random samples, whichinay be of unequal size, are 
lations with the same medians. The null hypothesis, in 
use of two samples, may be formulated as Hy :-Median, = Medians, ie., 
ihe two populations have the same medians. The test is based on the 
pinciple that about half of each sample’s observations will be above and 
shout half will be below the median, when Hy is true. To perform th2 
ws, we combine and arrange the sample cbservations in order of 
inteasing magnitude and find the median of the combined observations. 
Preach sample, we count the number of observations that are above or 
Wow the median of the combined data, ignoring the observations that 
te equal to median. We set out the numbers above and below median 
Meach sample in a 2x2 (or a 2xk, in case of k samples) contingency 


| telike the following: 
Above median 
‘ Below median 
h 
Meni of the procedure is the same as’ that of ¢ 
ence in contingency tables. Fisher's exact pro 


ould 
Sa when the total number of observations ina 
icient, The median test is easy and simple to apply, 


(a) 


HB 


22) independ 
wen fron popu 


hi-square test of 
ability fornula 
2x2 table :S 
put it is less 


vot 24.10. An identical examination was at a other 
s17 ig One class contains 13 students mail 

ents. The test scores of the 30 students Se —— 
54, 65, 66, 71, 73, 78, 78, 80, 82, 87, 92 9% 9. of 94 
51, 53, 54, 61, 64, 66, 67, 69, 71, 74, 76 80, 81 35,8 


a + c) (b +d) (a+b) (c +d) 
(30) [(7 = 2 
_ (30) 7) (6) = 8A? 4 405 


INTRODUCTION To sta ~ c TESTS PENSEE 
A TRI i ; E 
498 i TISTICAL y i ott AME pone oF more identical sy:tbols {or let:ers) th 499 
At a =,0.05, use the median test to test the hypothesis that ; uence followed and preceded by different symbols (or “et at are 
classes of students are from populations with an identical median 0 soy ae all. For example, i. the following sequence of = p 
' , 7 i 5 sa 
ü We formulate our null and alternative hypotheses as 4 ane ; nd 
3 i : class I are equal juss & Se i 
Ho: The median scores for c qual to the meq; p ao eee s , C22 tis 
for class II, and edian Store LY cT run3 rund run 5 run 6 
. The median scores for class I differ fr . . :< 6. The first run consi A ; 
ates Il om the median Score jumber of runs 1S : mbol and is flows a. pius signs (eytt) 
i ‘4 E: aceded. by 2O sy nwt y a sytabol of different 
(ii) The significance level is set at & = 0.05. 6 The second run contains 3 minus signs (i.e., — — —) as it is preceded 
* ae ve hes chi ao yod. d by different symbols, and so on. Similarly, the followi 
. (ii) The test-statistic to use 1s the chi-square statistic with 1 g yi followe 3 5 Tails » the following 
of freedom. eere | gence of 6 Heads an i 
š . : T s 
Gv) Computations. Arranging the combined observations in ord H T T H H H T H H 
; increasing magnitude and locating the median of the pes ys 5 runs. Too few runs or too many runs in a sequence present 
data, we find that median = 75. Counting the observations n „idence of non randomness. . 
o : al ` 
are above or below 75 in each class, we find that there areg The runs test, may be used to test the following two types of 
observations above and 5 observations below the median in clas theses: 
I; and there are 7 observations above and 10 observations below a m 
the median in class II. This information produces the followin () That the observations have been drawn at random from a single 
2x2 contingency table: ‘ population, (i.e., the sequence of observations is random). 
; i That two random’ samples come from populations having 
Class I | Classi | T m 
- identical distributions (implying that they have equal medians or 
Above median 7 15 means). - 
Below median _ 5 10 15 i) One Sample. To carry out the test, we find the sample median 
pants observation is above or below tne sample median, ignoring 
Now = x? = n (be — ad)? tte aie are equal to median. We thus get a sequence of plus 
( minus signs. We denote the number of symbols of one kind by ny 


d the number of runs 


th 
enumber of symbols of the other kind by ng an i 
of n, is less 


h ; 
1m, We reject our null hypothesis if the observed value 


i 
JSR T 19H 15 a or equal to the smaller number and is greater than or equal te es 
ai ‘ r aS à e 
(v) The critical region is %2 > x aay = 3:84. fea in Table 24.4, for the given values of n and ng at 
(vi) Conclusion. Si 4 does n% 
. Since t 2 = 1,425 dos” | Ith , 10,7 
e the calculated value of X i pypothes a as been shown that, when both n; and ng are greater than r 
from 


on in the critical region, so we cannot reject the nu 
© may conclude that the two classes of stu 
populations with an identical median. 


24, 
4.7 THE RUNS TEST FOR RANDOMNESS 


dents até 


proxi 
‘mately normally distributed with mean 


2n, no 
andy n 1 + no 
| Sandarg deviation 


2 nyng(2n yng -mT ng) 


(ny + ng)? (ny + M274 í 


INTRODUCTION To st 
500 ATISTICAL Ty o 


The test becomes 


statistic then 
ny — Hr 


O, 


which, if Ho is true, is approximately standard normal. We g 
ysual decision rules to reject or accept the null hypothesis, 


. Table 24.4. Critical Values-of n, in the Runs Test 


PPly the 


In this table, there are two numbers for each vah; 


R 


Le 
a aaa 
e 
| 9 | 


pies! 
rb es fs 15 [ 5 6 fe 


Gi) Two Samples. (The Wald-Wolfowitz runs test). To perform 


the test, we write down the n, + ng observations from the two samples 
and arrange them in one sequence according to their magnitude. We 
write down the letter A for each observation from sample 1 and the letter 
Pitot each observation from sample 2, thus getting a sequence of A’s and 
B’s, We count the number of runs in the whole sequence. If both n; and 
_ No do not exceed 10, then any observed value of n, less than or equal to 


the sn j 
naller number and > the larger number in Table 24.4 is in the crit 


region for &=0.05 Neo 
gion for &=0.05. If both n} and n, are greater than 10, we use the statistic 


2n\n 
ni? as before) 
ni + no 


n,— p} 
Z= o Z (where = 1+ 
whi : 7 
ich under Ho, is approximately standard normal. 


Th ; i 
rane e following examples illustrate how the ru 
omness are performed. 


ns tests for 


Example 24.1; jon items way 
p 4.1. Each dey a sample of 10 production “a daily 


taken and the ; 
"means: mean weight computed. Following are the ir 


13.0 12.8 12.9 13.0 13.1 12.9 12.6 12.5 12.7 128 
18.1 38.1 1%.2 13.3 13.2 13.1 12.9 13.2 13.3 134 


p 
e 
pi! o state our nU 


ical 


numb er 0 


fia SO 
below and abov ae 

; f runs e the median significant at 

level? ' 


1] and alternative hypotheses as 


d minus signs occur in a ran : 
ji qne plus an dom manner, and 


4 . The sen 
cance level is set at & = 0.05 


uence of signs is not in random order. 


@ statistic iS n, the number of runs in one sample. 


Jo ions. Arranging the observations in order of increasing 


de, we find the median to be 30.05. Now, replacing each 
n by a plus or a minus sign depending on whether the 


observatio . 
above or below the median, we get the following 


i observation is 
sequence 
= =h y P re 


Ioa eS = 


This gives 2) = 10, ng = 10 and n, = 6. 


p) Looking in table 24.4 for nı ="10, ng = 10 and & = 0.05, the 
critical region consists, of all values of n, < 6 and all values of 
n, 2 16. 


(i) Conclusion. Since the observed value of n, = 6 falls in the 
critical region, we therefore reject Hy and may conclude that the 
number of runs below and above the median are significant at 
the 0.95 level. 


_ Example 24.12. The following data represent the outsia? 
luneters of washers produced by two different production lines: ` 
1, 1.57, 1.84, 1.90, 


i 
EB[1.65, 1.69, 1.72, 1.91, 1.74, 1.75, 1.55, 1.86, 1.87, 188 
kasd 


A 
tthe .=0.05 level, perform a runs test to test the hypothesis that 


* Wo `rando : : he same 
eee m sampl ` lations having the sa 
ttibutions, pis mane wun pope CU; M.Sc., 1993) 


(i) W 
estate our null and alternative hypotheses as 


H T : i tke 
o The two random samples come from populations having 


; Same distributions, and 


Ha : 
n two samples come from populations ka 
Istributions. 


ying different 


iy 
he Si . 
8nificance level is set at & =0.05. 


502 | INTRODUCTION TO STATISTICAL THES peen drawn. The hypothesis to be tested is that th 
(iii) The test-statistic = eat ae in tWo-samples : gone omes from a completely specified distribution. = 
other words, the Wa j si is to be used), (In TP” ample © ve the Cu mulative distribution function based on 
(iv) Con:putations. Arranging men of the two samples ; an? deno observations, that is, S,(X) = k/n, where k is a 
' one sequence according to their magnitude and replacin es in y Je of "servati -as less than or equal to X, and let F(X) k 
observation from sample 1 y the letter A and each Prade, Je 0 ae cumulative distribution function Then z 
' from sample 2 by the letter B, we get the following sequence = sae é se ne aaa o ais pii Ae 
A’s and B’s: of yp? be ei for every value of X, S,(X) is expected to be fairly 
pAAAAAABABAA BBBABBBAB4 F onis e over ca between them {8,CK) ~ Fo] provide 
This gives n; = 12, 22 = 10 and n, = 12: and we compute cting the null hypothesis. The test ‘9 therefore based on 
2n; Ng 2 (12) (10) _ absolute difference D,,» defined by 
melt eng * "72 + 10 = 11.91 and eT 
i P n 
6, = 2 nyng(2n yng =n Tn) = „pengo easan- | ynl hypothesis ÍS rejected when D, exceeds the critical value for the 
(ny + ng)? (ny + Ng 1) (12 + 10)? (12 + 19-7 li sen level of significance. The critical values are obtained from special 
. (240) (218) ree ’ ais (see Table A-22, Steel and Torrie, second edition). 
= \ (484) (21) mean tara The Kolmogorov-Smirnov one-sample test is also used for testing 
i ypothesis about discrete distributions. The test is more powerful than 
jn 12-1191 . 0,04 J iscisquare test. 
Fr gi 1482, The Kolmogorov-Smirnov Two-Sample Test deals with 
(v) The critical region is |Z| 2 29.925 = 1.96. feagreement between two sample cumulative distributions. The null 
(vi) Conclusion. Since the computed val 7 iphesis to be tested is that two independent samples come from 
puted value of z = 0.04 does not fal $ datically distributed populations [F(X;,)] and the alternative hypothesis 


in the critical region, so we accept Hy and conclude that the two 
random samples come from populations having identical 
distributions. In other words, we may state that the two lines are 
producing equivalent product. 


24.8 THE KOLMOGOROV - SMIRNOV TESTS 


The Kolmogorov-Smirnov tests are useful 
teat = ai tests. The one-sample test was proposed in 
1989 4 oe hee while the two-sample test was developed F 
eki ik a sue Because of similarities between the P 
cine fro ie are based on cumulative distribution functions i 
with th ive frequency distributions), both names are associa 

e one-sample and the two-sample tests. 


nonparametric 


24.8.1. ; 
cee The Kolmogorov — Smirnov one-sample coat H 
etric alternative to the chi-square goodness-of-fit test. me 
s 


compares . s 
eer asi distribution function based 
ns with some specified population distribution from 


"iting 


that they come from populations 
itibution functions. In symbols, the hypotheses to be tested are 


Hy: F(X) = Fo(X.) for all X, and 
Hy: F\(X,) # Fo(Xq) for at least one X. 


aes BDEDORS, we compare the sampl 
Oe alain [S(X,)] at each sample va 
th a er in case of ungrouped data). The cumulative € sie 
mples are expected to be fairly close to each other if Hy is true. 


Alarga as 

ge , ; ‘ 
, ¥ difference between them at any point is therefore evidence for 
X) denote the 


th 
tative k l null hypothesis. Let Sn © and Sn,\ les 
tsten ian ative frequency distributions of two independent. sot i 
t ss ue respectively. Then the Kolmogorov-SmirnoY two-sample 
D n the maximum difference D, defined by 


e cumulative relative 
lue (after yanking all 
sive distributions 


= m = 

D + lS, © -= S, (X)), for a one-tailed test, and 
& 2 

; = max 2 , 
it Cage 3 (Sn 20 — S,, (X)|, fora two-tailed test. 
Tl h of Small samples i and n less than 40) and two-tailed pA 

| 5 ds th2 critical value for t 


esis is rej 
S is rejected when D excee 


INTRODUCTION TO STATISTICAL 
THEQ) 


504 ae 
ame critical values are obtai 
f significance. The critica as ined fr 
' a ies Trable ASS Steel and Torrie, 2nd edition). OF! Speci 1/6=0.17 
tables. ; j 
When both sample sizes nı and ng are greater then 40, fo, = 2/6=0.33 
Rs 


tailed test, the test-statistic to use is 


3/6=0.50 
4/6=0.67 
5/6=0.83 


6/6=1.00 


nino 


2 -= 4D? 
ae Care 


which has approximately a chi-square distribution with 2 d.f. 


; 7 2 
The null hypothesis is rejected if X? 2 Xo.os(2) = 5-99- 
The test-statistic based on chi-square approximation may also ‘ 


used with small samples. 
We illustrate the procedure with the following examples: 


Now D = max | S39) — Fy(X)| = 0.17. 


the critical value of D for n=10 and &=0.05 from the table of 
‘critical values for Kolmogorov-Smirnov one-sample test" is 


0.409. 

Conclusion. Since the observed value of D = 0.17 does not 
exceed the table value, so we cannot reject our null hypothesis, 
which is that the observed scores are uniformly distributed. 


v) 
Example 24.13. The following scores were obtained by rolling a 
six-sided die 10 times: 3, 4, 4, 2, 6, 6, 8, 4, 2, 5. 
Use the Kolmogorov-Smirnov statistic to test at &=0.05, the Wi) 
hypothesis that it is‘a sample from a uniform distribution of integer 
values between 1 and 6 inclusive. 


G) We state our null and alternative hypotheses as Example 24.14. The two independent samples given below are 


fom two populations. 
Sample 1 | 38, 49, 45, 29, 31, 35 
Sample 2 | 31, 42, 22, 26, 43, 37, 25, 30, 47 


Apply the Kolmogorov-Smirnov test to test the null hypothesis that 
the two samples come from populations having identical distributions. 
Usa 0.05 significance level. (1.U., M.Sc. 1995) 


Hy: the population distribution is uniform, and 


H; : the population distribution is not uniform. 


Gi) The significance level is set at Q = 0.05. 


(iii) The test-statistic is the Kolmogorov-Smirnov one-sample D 
statistic. 


(v) Computations. Arranging the sample observations in ê 
cumulative frequency distribution, we get l 


; Te t Sio(X) denote the cumulative relative frequency 
distribution of 10 sample observations, and let F (4) that 
cumulative probability distribution under Hg, where "0 


: 1 the 
each integer would occur wita probability = 6 Then 


() We state our null and alternative hypotheses as 


g identical 


Hy:The two samples come from populations havin 
distributions, or equivalently 
F(X) = FX), and 
H:The two samples come 
distributions, i.e. F(X) + Fo). 


The Significance level is set at & = 0.05. 


Scores .(X) 


from different population 


ti) 
ti) , a le 
y test-statistic to use is the Kolmogorov-Smirnov two-samp 
fy Statistic. 

v) j 
Com ; : together 10 
` mputations. We arrange all the observations fe ane 


Me å 

dom order of magnitude. We PT gni find the 

diff “lative relative frequencies at each sample valu The ordered 
erences between them at each listed pole 


maximuin difference D is computed as follows: 


` 


INTRODUCTION TO STATISTICAL Ty 
Eory 


506 3 
corresponding values of Sg(X,) and SolX 


So(X-) are given below: 2), and 


sample values, 


differences SX- 


Ordered Observations 


SZD | ZED | ISE) - Sox, 
10-1/9| = 1/9 
10-2/9| = 2/9 
|0-3/9| = 3/9=D 
|1/6-3/9| = 3/18 
|1/6-4/9| = 5/18 
|2/6-5/9| = 4/18 
|3/6-5/9| = 1/18 
|3/6-6/9| = 3/18 
|4/6-6/9| = 0 
|4/6-7/9| = 2/18 
|4/6-8/9] = 4/18 
|5/6-8/9| = 1/18 
|5/6-1| = 1/6 
J1-1| = 0 


i 1g 2nd ny = 6 and ny = 9 
(v) The critical value of D for n,=6 and ng=9 from the tables of 
Critical Values for Kolmogorov-Smirnov two-sample test’ at 

qa=0.05 is 2/3 = 12/18. | 
a As the calculated value of D doe 
e value, so we cannot reject Hy. We conclude that the 


sam ` i ibuti 
ples come from populations having identical distributions. 


5 not exceed the 


(vi) 
two 


24.9 THE KRUSKAL-WALLIS H TEST 


An , i ; i 
onparametric alternative procedure to 4 one-way analys the 


vari ` 

Whitney U in A his test isa generalisation of the two-samp!e i 

faldhendent . It tests the null hyvothesis Ho 
samples come from populations with equal mean 


e same populati 


por 
n satisfie ; 
W. Allen Wallis. 


o have k in 


enote the 
the test, We ar 
„ed in increasing OT 
_In the case of ti 


be assigned 
s wri 


le, 


on against the alternati 

= ts native H, th 
from n differs. It is an extremely useful test when i 
ally distributed populations and- equality of 


507 


d. The test was introduced in 1952 by William 


dependent samples of sizes ny, Ng, ..., Nn 

observations of the ith sample by X;,, xX. : 
range all the n jma bie of fod 
der of magnitude and assign the ranks 
es, we assign the average of the ranks 
if there were no ties. To distinguish the sample 
te the letters A, B, C, etc. for the observations of 
the second sample, the third sample, etc. respectively 
ordered observations or in a Tally Column. We then 
ions of k samples with their corresponding ranks. we 
; h sample and denote the sums by R}, Ro -~ Ry. We 


2 2 ; ; 
s, = Lrj , where rj 1S the rank assigned to observation Xj. 


there are no ties, then 


gi BMS 1) (2n + 1) 
ro 6 ` 
ieKruskal-Wallis statistic H is given by 
2 
a (m= a 
S,-C 


here C denot ` 
nita 12 es the appropriate correction term 


4 


Inthe 
Case ; 
of no ties, the statistic H simplifies to 


Wa Tule of th : 
tin ties umb is that not more than 25% 


of observation 


and is given by 


12 S% 
Bi ne nE 
teg mula may b i erally 
y be used when there are 4 few ties. A gen b 
s be 


INTRODUCTION TO STATISTig 
AL 


f 11 hypothesis Ho : All k population Eoy 
We reject the n% np eens s have igo 
distributions, for large values of f. se there are only thre, Uq 


fairly small (5 or fewer) we determine the si sam ly 
ie H by reference to the Kruskal and Wallis’ Tap ian 
e 


gives critical values for all combinations of the n,’s upto 5, 5, 5 Whig 


When at least one of the samples has more than 5 obse 
there are more than 5 observations in each sample and Hy is pe 
sampling distribution of statistic H is approximately a ie 
distribution with (k-1) degrees of freedom. We reject Hy if the it ate 
at the a level of significance. atl 


Tvations tt 


2 
value of H 2 Xa,(k-1) 
The following example illustrates the procedure: 


Example 24.15. The following data represent the operating tim 
in hours for three types of scieatific pocket calculators before a recy 
is required: 


Operating times (in hours) 
4.9, 6.1, 4.3, 4.6, 5.3 

5.5, 5.4, 6.2, 5.8, 5.5, 5.2, 4.8 
6.4, 6.8, 5.6, 6.5, 6.3, 6.6 


Use the Kruskal-Wallis test, at the 0.01 level of significance, to 
test the hypothesis that the operating times for all three 
calculators are equal. (P.U., M.Sc. 1991, 9) 


G) We state our null and alternative hypotheses as 


Hy: The mean operating times for all three calculators are equal, 
ie. Hy = Hg = Hg; and 

Ho:The mean operating times for at least two of the th 
calculators are not equal. 


ree 


(ii) The significance level is set at a = 0.01. 

(ii) The test-statistic to use is the Kruskal-Wallis H statistic. 

(iv) Computations. Arranging the observations of the reat | 
samples in increasing order of magnitude, assigning ranks f 


ie ues, 
writing the letters A, B and C to distinguish sample "? 
we get 


yoNPARAMETRIC TESTS 


Ordered 
Observations 


observation with its corresponding rank as below: ple, we replace each 


8.5, 7, 13, 11, 8.5, 5, 3 


15, 18, 10, 16, 14, 17 


2 
R: 
2_ sti _ (25)? (56)? (90)? 
Now dp Dg tg tg a 


2 2 
= = 2 
S, ga (4)? + (12)? + (1)? +... + (17)? = 2108.5, 


n(n +1)? 18(18 + 1)? 


Se ES 


= 2 = 
_ ©- PIS- C1 _ (17 (2928 - 1624.5) 
s?-C 2108.5- 1624.5 


_ (17) (298.5) 


484 = 10.48 


OR Since there is only one tie, we can calculate the value of H by 
applying the formula 


2 

12S k R 
= ko i yA 
o on(n+1) il: ls De 


Thus g = 12 (1923) 261.41- 51 = 10.47 
wasn : 


” 


(v) 9.21 


The critical region is H > Tina A 
= 10.48 (or 10.47) 
Ho and conclude 
Jculators are not 


ed value of H 
herefore T eject 
r all three ¢4 


(vi) Conclusion. Since the calculat 
` falls in the critical region, we t 
that the mean operating times fo 
equal. 


INTRODUCTION TO STATISTIC 
: ALT 
0) 


510 — 
EXERCISES 
: derstand by nonparametric tests? 
wna; do you Y° sputi 1c tests? W 
a tests also called distribution-free tests? Give the Ais ate st 
disadvantages of nonparametric tests over parametric t ats 
: . PU, MSc, 198 
(a) How do nonparametric tests differ from parametri J 
dvantages and disadvantages of non. testy 
į ameti, 


24.2 


24.3 


24.4 


required: 1.5, 2.2, 


24.5 


24.6 


Discuss the a 


tests. 
cribe the Sign test. When is it most appropriately u 
sed? 


(b) Des 
Explain the difference between the Wilcoxon signed-rank 
and the Sign test. (1.U., M.Sc. 185 


the Wilcoxon signed-rank test for one sample, Hoy 


(a) Describe 
does it differ from the sign test? 


(b) Test the null hypothesis that the median of the populatin 

from which the data below have been obtained, equals 5 
alternative that it is less. Use (i) the sign test, (i 
d-rank test. 


52, 65, 47, 58, 57, 


against the 
the Wilcoxon signe 


48, 51, 49, 53, 61, 59, 45, 


65, 56, 45, 49, 54, 63, 46, 57, 54, 58, 52, 45. 
(1.U., M.Sc. 1987, 91,9) 


The following data represent the number of hours that 
rechargeable hedge trimmer operates before a yechaige Ë 
0.9, 1.3, 2.0, 1.6, 1.8, 1.5, 2.0, 1.2 and 1.7. U 
d-rank test to test tht 
that this particl 
quiring? 


G) the sign test, (ii) the Wilcoxon signe 
hypothesis at the 0.05 level of significance 
trimmer operates, on the average, 4.8 hours before re 


charge. 
. ak 
A sample of size 8 was chosen frora a population. The sa? 


observations are given below: 
2.55, 4.62, 2.93, 2.46, 1.95, 4.55, 3-1 
n signed-Tan™ is 


Using (i) the sign test and (ii) the Wilcoxon >" i 
the hypothesis that the median of the population ri 
the alternative that it does not. ren 


1, and 0.90. 


d-rank t 


(a) Compare the Wilcoxon signe 
n the rationā 


matched pairs. Also explai 
Wilcoxon signed-rark test. 


NoNPARAM 


Var: 


94.7 


24.8 


24.9 


paper TE E, «| 
(b) Two varieties of tomato were ; 

z feton Sir experim: ; 
their fruit-producing abilities, ean ep oa concerning 
following data were obtained: red in pounds. The 


varicty 
jety B 2.28, 3.63, 2.17, 3.56, 3.73 


ETRIC TESTS 
511 


A | 3.03 3.10, 2.35 
; 5, 3.86, 3.91, 1.72, 2.65, 2.30, 2.70, 3.60 
, ie 3 a 


1.85, 1.48, 1.86, 2.76, 2.68 


Apply (i) the sign test, (ii) the Wil 

Wiad: coxon si 3 
0.05 level of significance, to test the Sey k at the 
difference In fruit-producing abilities of the two we iii 


It is suspected that high school 

vai graduates who plan to maior i 
mathematics in college would score more than five pints ti ne 
in a natural sciences test than in a social sciences test. The 
following are the test scores of 20 such students: anal 


60 11 55 80 


ee eee 
Test the null hypothesis that the natural sciences test scores (X) 
are five points higher than the social sciences test score (Y) 
against the alternative that the former are higher by more than 


five points at a = 0.01 by applying (i) the sign test, (ii) the 
Wilcoxon signed-rank test. 

Explain fully the following nonparametric te 
(a) The Wilcoxon rank-sum test. 
(b) The Mann-Whitney U test. 
(c) The Median test. 

(d) The Runs test for randomness. 
Five samples of each of two types 0 
87 


sts: 


ed as follows: 


f paint are scor 


Paint I 
Paint II 
Analyse this with the Wi 


on rank-sum test. 


cox 


24.10 Give 


26, 25, 38, 33, 42, 40, 44, 26, 25, 43, 35, 48, 37, 


47 


24.11 (a) A taxi comp 


kilometers, is 
Type A+| 29, 27, 23, 30. 
Type B’| 24, 37, 35, 19, 40, 31 


Use a Mann-Whitney U test to test if there is any difference 


in length of life of the two types of tyres. 


(b) Given below are the scores obtained by two groups of 


trainees: 


groups differ significantly. 
24.12 Given below 


26, 25, 38, 33, 42, 40, 44, 26, 25, 43, 35, 48, 37. 


_————— 


Test the hypothesis that the population medians are equ 
Hy:M,, = M,, against the alternative that My < Mm 
(1.U., M.Sc. 


24.13 Use the Mann-Whitney U-test to test the hypothesis that the 
difference between the mean scores of students in arithmetic 


computation in two types of school are equal, using o.=0.06. 


the Mann-Whitney U test. 


Given the data: 


Mark 
Marks | o=o [10-19] 20-29 | 30-39 


Residential 1 


40—49 


Non-Residential 


machine-instructed group of students was compare 


Group I | 31, 28, 42, 36, 29, 51, 34, 25, 44, 33, 49 
Group II | 27, 45, 30, 53, 41, 39, 48, 43, 26, 37 
Use a Mann-Whitney test to test if the scores for the two 


(I.U., M.Sc. 1998) 


are the ages of 29 executives of a certain company: 


3 10 63 38 5 
4 7 25 37 13 4 


; 24.14 In an experiment on the effectiveness of a teaching ™ 


INTRODUCTION TO STATISTICAL THEO 
Ry 


n the two samples below, test the aull hypothesis that th 

lation medians are equal against the alternative sat 

a=0.05 by applying the Wilcoxon rank-sum test 
pete Le amen RRR eter tl 


AA, 30, 34, 47, 35, 46, 35, 47, 48, 34, 32, 42, 43, 49, 46 


any tests two types of tyres on the ten cars of its 
fleet. The length of life of the tyres, in thousand of 


44, 30, 34, 47, 35, 46, 35, 47, 48, 34, 32, 42, 43, 49, 46, 47 


al, i.e. 


applying 
1992, 95) 


achine, a 
with 


= 
= 
=> 
‘Ne 
py 
v 
eS 
Q 
= 
= 
v 
=) 


yere obtained: 


TS i 2 
pMEeTRIC ne ; ——— hB 
ner} structed group on an achievement test. The follow 
ng 
y 


70-49 | 50-59 | 50-69 | 70-79 


2 “Whitney U-test to desermine whether there is a 
re icant difference ir achievements of the two groups. 

$ (P.U., M.Sc. 1989) 
Discuss fully the procedure and rationale of a two-sample 


115 a) 
: median test. 


(b) Using the data in question 24.19, test at a-0.0E the 
hypothesis that men and women come from. populations with 
the same median, applying the median test. 


uie Use the median test at the a= 0.05 level, to test the null 
hypathesis that the two samples are drawn from populations 
with the same median. 


92, 63, 30, 78, 24,19, 26, 79, 54, 57, 

97, 46, 58, 74, 77, 89, 93, 99, 78, 50. 
e dee ee 

77, 87, 98, 62, 76, 41, 66, 83, 72, 80, 

53, 80, 48, 75, 76, 18, 97, 53, 64, 67. 


F Sample 1 


Sample 2 


411 Given the information below for four samples: 


29, 40, 43, 30, 46, 34; 32, 27 


Sample 1 
Sample 2 35, 39, 33, 31, 40, 34, £3, 29 
Sample 3 34, 43, 42, 31, 35, 40, 25, 42 


5, 42 
Sample 4 


. t a 
Use Median test to test the hypothesis a 9.05. U” 
come from the identical population. Use 


Ug eye 
. 2 P nstrue 
(a) A true-false examination was CO 


running in the following gequenc® sT 
Ra n T 3 a etl 
TRETFTETTFTE 7 apo som Fant x 


Does this sequence indicate ê ? 
1 


r5: 
P answe 
the arrangement of T ana 


514 


24.19 


24.21 


24.22 


24.23 


INTRODUCTION TO STATISTICAL ii 
. Ory 
i f boys and pink .: 
Would the following series o y dilk 
(r) haphazardly at a coeducational school, be accepta ceed 
random sample? 
GgpBGGBGBBBGGGGBGBGEECEGGGRy 
GbGGBGGBGBGBBG 


(a) At the = 0.05 level, perform a runs test to determine if th 
followirg sequence of measurements is random: 


316,5., 4:1, 3.6, 3.8, 3.17, 3:4, 4.0, 3.8, 4.1, 3.9, 4.0, 3.8, 42, 41 


(b) Consider the data in question 24.10 to be two random 
samples and use the run test to test the hypothesis that the 
two samples are drawn from populations having identical 
distributions. Let Q = 0.05: 


24.20- 


(a) What is the function and rationale of the Kolmogorov- 
Smirnov one-sample test? Also explain its technique. 

(b) Test the hypothesis, by the Kolmogorov-Smirnov test, that 
the following sample: 0.36, 0.92, —0.56, 1.86, 1.74, 0.56, 
—0.95, 0.24, -0.15, -0.74, 0.32, 0.82, 0.70, —0.10, —1.06, 0.15, 


0.55, —0.48, —0.49, —1.26 comes from a normal distribution - 
with mean 0.5 and variance 1. Use a= 0.05. (1.U., M.Sc. 1987, 92) 


The following data show the kidney weights in grams of 36 dogs 
prior to their use in experiments. 


58 78 84 90 97 70:90 86 82 
59 90 70 74 83 90 76 88 84 


i 68 93 70 94 70. 110 67 68 15 


80 68 82 104 92 112 84 98 80 


Do the data provide enough evidence that they have come from 4 
normally distributed population with a mean of 85 grams and j 
standard deviation of 15 grams. Use Kolmogorov-Smirnov test. 


(a) Apply the Kolmogorov-Sr.irnov test to the data of question 
24.10. Use a two sided alternative hypothesis. 


(b) Apply the Kolmogorov-Smirnov test to the data of quest 
24.13, to test the hypothesis that there is no differe” 


between the two types of schools. 


i he 
Four groups A, B, C and D are involved in an experiment. T 


ta in each group are randomly selected and the follo 
present the results of the experiments: 


aBa 


wing dale 


y Group A 


57 


‘Group C 


Group B 


. Group D 


Use the Kruskal-Wallis H test to decide whether the four groups 
are taken from populations with the same mean at a=0.05. 

yo4 The following information reflects the results of the tensile 
strengths of the molded parts produced by different methods 
(a, B, C, D): y 


a 


80, 88, 87, 86, 90, 88, 85 

99, 91,-98, 98, 99, 96, 92, 98 
89, 82, 81, 80, 86, 86, 86, 84 
16; 11, 15, 18, 16, 13, 1h 80, 75, 80 


P r the mean 
Use the Kruskal-Wallis test to determine whether t 
tensile strengths are the same for all the me ’ 


O 

? O ote ate 00%" 

e e, ate ato 180 050 Qh Me? “0” e 
ate age ogo ee %° tee ate age 60? te 


ie 


appendix - A: Vital Statistics 


ning of vital statistics. The 
«ths 8 a ke and composition of oeyt factors which 
a re called ey pd some members of the nf Loy eg, 
ich shale ao and they include ee 
ickness, adoptions nee of the population), 2 
opulation composition). Th , legitimation, etc. (which affect ges, 
constitute Vit l 5 collection, presentation and C sia | 
d as the data M e istics. The term Vital Bahi a | 
cane : atically collected and com aid ay also 
g to or derived from recor ds piled in numerical 
the whole study of man, and throws light on v 


and medical problems. 
A2 Re i i ; 
birth or Registration Sco in Pai d Deaths in Pakistan oe 
: } È lace in Fa istan, 1 i 
national Te , is an integral art 
error caters hara The Provincial Governments ka Hia 
eia enai eaths they register through the local bodies to the 
Sitar arans e offices. The provinces of Pakistan contain rural" 
ome s , an the registration of births and deaths is carried out 
ely for the two types of areas. ; i 
ths, and the - 


ee rural areas, the registr 

enance of vital statistics ist councils. It is 

{ the househol i ich a birth 

T Í a birth or death has 
rred, to get the event registered Wi 

n council dire 

an) who gets jt r 


1 Mea 
hanges 


the uni 
4 akis council when he visits 1t. 
ries in birth and dea 


| 
registration offices. 
In , : 
hes coal she Municipal Corporations Municipal 
onment B ds register al 
l event 1S 


Commi 
ee Town Committees an 3 
nd deaths occurring within their respective imits. + 
head of the househo in which @ 
he midwife/dai 


| 
Teported for i 

birth for its registration py the 

ttendin death has occurres- In respect of births, ie 

Me aes mia birth; is also require to report the occurrencn “a : 

s i T pirth of e 
ion office. ecurre w a 
y rge of suc 


over 
any r Sinha or Municipal 
an in DEP public institution, is 
stitution : 


hospital, private 
„ported py the officer!” 


518 


A.3 Uses of Vit 
are enumerated below: 


. VITAL STATISTICS k 
NDIX A: vi 

p formation in respe = P 

Infor spect of infectious diseases ic’ 519 i 

S ls grossly uni 


(ii) : 
reported. me importance of such report; 

pecause of illiteracy of the population Porting is not reali 
F alized 


INTROD 5 
UCTION TO STATISTIC 
AL TH 
EOR 
Y 


al Statistics. The important uses of ae 
l l sta . 

tistię 

N 


The vital statistics exhibit the changing pattern of 
10. the 


(i) 
of a country and reveal the virilit: 
y of the rac © populati 
rates computed fr ; es. Birth ang a 
Sa puted from them, throw light upon medi h and dean | gg Data on vital statistica 
ygienic conditions, standard of living, etc edical faciliti iv) F aple wid i are sometimes deliberat 
be $ i » Ete. : ies, or npie, owed or di rately misr 
(ii) On the basis of trends in births and deaths singles or they may not tell Pleo May report th ae a 
plan their production of goods such as f F many manufacty Th $ : , eir true ages. emselves as 
=, and other articles needed for chi ood, clothing neg w OEEO ha E 
; ¥ » Mm A ; sat 
er a su children and aged people edicines and with respect to the presentati n stages of compiling pr 
1e vital statistics are used b l E f statistic lk ation of vital statisti process 
private agencies to dra A y a number of govern of s al knowledge. atistics due to lack 
; raw up their soci 'nmer i i 
Fastence, Manning and P r social and economic p] at and (vi) Delay in compilation and t à 
telico planning and production of housing and Plans, For defects from which the vi abulation is also one of 
ea g and operating of social securi educational e vital statistics suffer sagen ae 
etc. depend on vital statistics. urity programmes A.5 Rates and Ratios. For pur 
(iv) Vi EPRE i ive numb i Ee poses of compari 
i a statistics in respect of incidence of dis i Oke we z ne aai statistics, the commonly hao a jidra 
S, € ; ; I eases a atios. ive i 
prevent 7 are of great use.in taking a. Baa The rati m 
inoculatic control the spread of diseases. Arr measures to if S gara of one number, "a" to another number "c" i 
: ation and vaccination are made on re - Arrangements for divided by c". It thus indicates the relati oe cennediby n 
e outbreak of any epidemic ceiving information of Ordinarily, a and c represent separat vE Size or two numbers: 
(v) Birth and d : a i statistics, a ratio express parate and distinct categories. In vital 
cathcertifieate sare nek , es the relatión of a given kind of 
(vi) Ac needed in many circumst occurrence of other events or one ki ind of event to the 
i w registered and systematicall pean write, e kind of data to another. Thus, we may 
can a ; Pe 
e used to check up the acciir y collected vital statistics : 
census. racy of data provided by the a 
(vii : Ratio = — 
) The Lema i ae cee i c? 
population of ac udied to know whether ere a denotes th F 
Ee : ountry is increasi pill whether the e number of times the given kind of event occurs, 
(viii) The actuaries use vital stati sing, decreasing or is stable. and ' 
premiu statistics when estimating life insur c 
shh Th m to be charged. estimating life insurance FA denotes the number of times another event occurs. 
e busi ' ei ` : ; a ot #43 
ee prospects of jewellers, furnitur tatio, it ce, types of ratios used in vital statistics are the Sex 
(x) Th ed by the trend in een eer iture manufacturers are i women ratio, death ratio, birth ratio or vital index, etc- 
. ; es. i 
» -Aga statistics are the admini . : ea type of ratio in which the denominator represents the total 
re ic health agencios, ministrative and research needs of | umber ‘6 cases and the numerator represents certain fraction of this 
r 21S ° 
defects t Shortcomings of Vital Stati ' A CAA a proportion, 
(i) m which our vital statisti tatistics. Some of the importan! || tume Tate is a type of ratio, which in vital statistics may be ah ia 
There is evidence that ep stilton are stagod keuw the proportion of the number of vital events to the population ! 
Bi eiii at m ig 
ai : ural areas. Many ‘an births escape registration, especially e events took place. In other words, 
mica These pee marriages, divorces also remain “4 a 
äi er-registration, the exact thus suffer from a large amount f Rale=a+ 0 
ii) xact extent of which is wn. Wher ` "as? a given | event occurs: and 4 
not kno e ast : pe given vila 
ands for the number of tires the £ r 
: vent does not oe? 
times, the € 


Data or 
n the incid 
ence $ 
e of diseases, on ages at death and on the bdenot i 
otes the number 0 


cases of de 
ath are notoriously unreliable 


A 


- Tel 
INTRODUCTION TO STATISTICAL a 


_ee 

„e usually multiplied by 1,000 for ease in unders 

g. They are generally of two types, namely, the ne 

ific rates. A crude rate is defined as a ratio of eas 

occurring during a year to midyear population during the om, 

without-regard to any gpecific characteristics of the population ee 
a specific rat "UN the 


ae of} ` pee 
other hand, e is defined in terms of one or me 
{the population. re 


characteristics 9 
A.G Sex Ratio. The ratio between males and females in a him 
an 


population, is called a sex ratio. It is computed by dividing the number 
males in a population by the number of females in the same sone at 
and the result is expressed in percentage. In other words, ton 


EORy 
nding 
rateg 
Ventş f 


Number of Males 
a E eed j 
Sex Ratio Number of Females 2100 


Thus sex ratio indicates the number of males per 100 females. It is 
of great importance to note that a sex ratio more than 100, would imply a. 
preponderance of males, Le. it would mean that there are more-men 
than women. A sex ratio less than 100, signifies that there are more 
women then men, Sex ratios for some specific age-groups may also be 


` computed separately. 


Example A.l Calculate Sex Ratios by Age for the 1961 census 


population of West Pakistan and examine your results critically. 
(P.U. B.4/B.5e. 1970) 


Sex: Ratios 


- Males 
C000) 


Ages 
(years) 


(C900) 


——— a re 


| Females 
\ 


3,300 3,157 
3,456 3,016 
4,014 3,328 | 
. 20-29 3,221 2904 -| 
30 -39 2,456 2,161 
. 40-49 1,882 1,541 
50-59 12725 | 996 
60 & ‘ver | 


rn a 1,563 AAT 
tos ta T od 
pe lages | 21,168 | 18,274 
u ee eham 5 F 
ages ems Bn Sex ratios (number of males per 100 females) at di 
above table jn compuüteù, and are given in the fourth column ° gust 
pse “4 | ratios reveal fluctuations The sex ratio for he a 
frequent than f .5, which signifies that the male births ar® slightly ™ 
numerous ee ta aa The boys between the ages 10-19 
girls in the same age group. The sex ratios or ag 


„ppeNDIX a; VITAL STATISTICS 52 

' EEE 
how unusual preponderance of men, owi 

“ati ? en, owir to y 

: vet death rate of females at higher ages. it ae 


A.T Child-Women Ratio. The ratio between children under 5 
ars of ag? and the women of "child bearing age" is called a child-women 


and sometimes by age-group 15-49. The child-women ratio is computed ' 
py the formula x 


aese ; P 0-4 al 
Child-Women Ratio = 7 x 1,000, 


15-44: 
denotes -the number of children, both sexes combined, 


where Po-4 
under 5 years of age, and 


A fyp-aadenotes the number of females (women) between ages’ 
15-44, (fys-a9 15 used when child-bearing age is taken as 
15-49). a 

A.8 Birth-Death Ratio or Vital Index. The ratio between the 


births and the total number of deaths of a population 


` total number of 
h-death-ratio or vital index. It is 


during a particular year is called birt 
computed by the formula ; 

í z Total Number of Births x 100 

Vital Index Total Number of Deaths 

00, indicates that the population is 
n. When it is less than 100, it 
and the population is stable 


A vital index more than 1 
increasing and is in a healthy conditio 
implies that the population is decreasing, 
when the vital index equals 100. l 

A.S Population Growth Rate. The annual population Bron 
rate is computed by dividing the increase in aaa a e polation 
by the population at the beginning of that year eae p ox P nduding 
ofa country is available each year. BUD = ag the 
Pakistan, he votal population is.» ee + popolati 
Population figures are obtained by the ener’ re 
In such cases, the annual population growt 


formula (compound interest formula) 


P,= Po pr)? 
Where Po denotes the population, 
| i gars, 
P,, denotes the population after Y 
n denotes the intercens@ . 
m ate of changes 


r denotes the unknow’? warn 
expressed in the form of pe 


INTRODUCTION TO STATISTIC 
AL THE 
Ory 


a og Panam econ ENS 
tion of Pakistan according to the ib 
2 


la 

example, the popu : 

aes 65,309- thousand and according to the 1981 census w 

censya utation of growth rate the intercensal a 
n 


i ousand. For comp 
eo as 8.46 years as the reference dates for the 1972 censys i 
was for the 1981 census, it was Ist March 1581, 


Se tember, 1972 and nsus 
10th. Sep he Pakistan Census Organization, Statistics HE 


(Source: , wt 
Government of Pakistan, Islamabad: 1951-81 Population gf. 
Administrative Units; as on 4th February 1986"). 
Substituting the values in the formula, we get 
84,254 = 65,309 (1 + r)8-46 
84,254 i 
$ 8.46 - —— = 1.2901 
a i 65,309 


Taking logs, we get 
8.46 log (1 + r) = log (1.2901) 


; 1 ; T 
Eo log (1 +r) = 3.46 [0.11090] = 0.01311 


or 1+ r = Anti log (0.01311) = 1.030 


r = 1.030 — 1 = 0.030 , 
Thus the average growth rate of the population of Pakistan during 
the intercensal period 1972-1981 was 3.0 per cent per annum. 


The formula given above is sometimes written as 


pN” 
r= P. -1 


and is known as the Geometric method for estimating the a 
rate. 
A.19 Classification of Vital Rates. The commonly emplo 
rates in vital statistics: may be classified as follows: 
(a) Death Rates or Mortality Rates. The kinds of death ra 
are: 
(i) Crude Death Rate; Gi) Specific Death Rate, 
(iii) Infant Mortality Rate, (iv) Case Fatality Rate 
(v) Standardized Death Rate. 
(b) Birth Rates or Natality Rates. The common} 
rates are: . ` 
G) Crude Birth Rate, qi) Specific Birt 
Gii) Standardized Birth Rate. 


duction Rates. Th - 
(c) Kepro - There are two im 
portant types of such 


nnual growth 
yed 


tes 


y used pirth 


h Rates a 


A: VITAL STATISTICS 
523 
rates, namely, 
(i) Gross Reproduction Rate, (ii 
ais » (ii) Net Re ; ; 
(a) Morbidity rates or Sickness Rates production Rate. 


(e) Marriage Rates 
(p) Divorce Rates, etc. 


A.11 Crude Death Rate. For a given a 

: rea, the cr : 
may be defined as a ratio of total registered deaths of He rate 
jg the total midyear Population tn the same year, eniplied by 4000, 1 


is computed as follows: 


C.D.R. = ox 1,000, 


where C.D.R. stands for crude death rate, 


D denotes the total number of deaths from all causes during a 
calendar year, and 

P denotes the midyear total population (which is taken as an 
estimate of the average population during the whole calendar 
year) during the same year. 

The crude death rate is perhaps the most widely used vital rate as it 
is easily understood and quickly computed. It merely tells that so many 
persons have died during a year, that is it represents the probability of 
dying for persons in the population. Mortality as it is a well known fact, 
varies with age, sex, race, occupation, but the crude death rate ignores 

"all these factors, therefore it is likely to be misleading and hence should 
not be used for comparison between areas. The crude death rate 
, measures the decrease in the population due to deaths and thus has a 


value as an index of mortality. 
tes are computed for 


A.12 Specific Death Rates. F 
some specific class of people oF specific age-grour Ot ee = 
are called speciyic death rates. The kind of spec! icity 

The most importaut and widely applicable ro ies 
age-specific death rates and age-sex-specific death rat% 


the formula 
Age-specific death rates are computed by 
d; 
_ tx 1,000, 
AS.D.R. = P, x 
i th rate, 
Where A.S.D.R. denotes the age-specific ae edt, rai R: 
a h mber of deaths occurring in 
; denotes the nu 
. : pa coup. 
specified) during ê given oe A af ae 
ar popu 


P; denotes the midy¢ 


524 ~ -, INTRODUCTION TO STATISTICAL ni 
D akoa eaoat rate measures the risk of dying ih eee, fe a 
age-groups selected for the computation. When age-specific death i 

“are computed separately for males and females, they are called age-ser. 

specific death rates. They are used in the computation Th 

reproduction rates and life tables. i 


Example A.? Calculate: age-sex specific death rates 
following data for selected age-groups: 


Midyear population No. of Deaths 


- 90,229 | 69,379 


for the 


Age-group 
(years) 


959 | 


83,442 56,371 


`| 164,377 


103,014 


111,878 | 68,724 


69,008 48,777 


39,782 | 29,420 


Solution. Calculation of age-sex 
groups: 


Males: . 


Specific death rates for selected age- 


` The age-specific death rate for the age-group 10—14 is 


ASDR. = Number of deaths occurrin in age-group 10-14 x 1000 


Midyear population in age-group 10-14 
- _ 1048 
90229 * 1000 = 11.56 


- The age-specific death rate for the age-group 15-19 is 
A.S.D.R. = Number of deaths occurring in age-group 15—19 š 1000 
Midyear population in age-group 15-19 
_ 931 
= 93449 X 1000 = 11.16 


\ 

age a a a i mpute the age-specific death rates for the remaining 
7 male Population ; ein female 

Population. The age- Population and for all age-groups of 


š Specific death < the data are 
Shown in the table that follows rates calculated for tl e K 


| i tiv) sometimes infant deaths are not 


. VITAL STATISTICS 
NDIX A:. 
apre 


Age-specific death 


Age-group 
(years} Tates per 1099 
Persons per year 


. 26.31 

A.13 Infant Mortality Rate. It is defined a 
deaths of infants during a specified year to the total l 
in the same year. The formula thus becon:es ` 


a ratio of registered 
ive births registered 


do 
LM.R. = zy X 1,000, 


where I.M.R. stands for infant mortality rate, 
. dọ denotes the number of deaths (excluding foetal deaths) under 
-one year of age registered during a given year in a locality, and 


B denotes the number of live births (A child showing any evidence 
of life is registered as “live birth” registered during the same 
year in the same locality. 


: ʻe of 
The infant mortality rate does not provide an accurate measure o 


` the risk of death during the first year of life be zause 


.() infants are usually under-enumerated, sane 

(i) the babies who die immediately after birth, sre often not 
registered as live births, 

(iii) some of the deaths under one year 0 
must have been of infants who had 
calendar year, and , 


uring a calendar year 


f aged preceding 


been born in the 


eparated from stillbirths and 
s 


abortions. ently used for many, 
he maternity cases 
hygienic 
level .of 


however frequ # 

e signifies um ae 
ilities ar 

Faas indicator of the 

s 


The infant mortality rate is 
Mposes, A low infant mortality rat 
we Well attended, medical care amet 
“nditions are good, etc. Thus it se 


®althiness" of a society. 


INTRODUCTION TO STATISTICAL THEOR 
Y 


a8 pan Fates, The erode doth ries oft 
A.14 Standardized Death Rates- The crude death rates of tw 
> cannot be compared because monalin 
y 


o occupations £ 
sex, climate, occupation, etc. Aged people even 


1 fed, die comparatively at a higher rate tha 

the young people. Good climate is conducive to long life whereas a 
P“ ious to health. Moreover, mortality is highest at th 

; such spurious effects, we compute hat 


rates can be computed either by 
(i) direct method, i.e. by applying different age-specific death 
rates-of the population being studied (or given population), int 


standard population (which may be any population selected to be 


the basis of the comparison), OY 
indirect method, i.e. by applying different age-specific death 


(ii). 
d population to the population being studied. 


rates of a standar 


hod. The direct methed for computing the standardized 


Direct Met 
death rate consists in calculating the number of deaths that would be 


expected to 
of the given population were to apply; and dividing the sum of the 


expected deaths by the standard population. That is the age-adjusted or 
standardized death rate is computed by the formula 


Expected deaths in standard population 
x 1,000 


s.D.R. = 
Total standard population 


d: 
a 
= —1— x 1,000, 
i 


where S.D.R. stands for standardized death rate, 


d; denotes the number of deaths occurring in the given population - 


in the ith age-group during a calendar year, 


p; denotes the midyear population figures in the given population in 


ith age-group during the same year, and 
P; i i 
i -e the midyear population figure in the standard population 
in ith age-group in the same year. | 


If 7 : ; 
the death rates are adjusted both for. age and sex, the expected 


deaths are calculated as follows: 


Expected deaths for males (m) = dim x p, 
Pim im? 


"are called the corrected OF standardized death rates. Standardized death 


occur in a standard population if the age-specific death rates* 


527 


piš A: VITAL STATISTICS 


gxpected deaths for femcles (f) = dif x Pir 
; a; aoe 


tandardized death rate or the age-sex adjusted death rate i 
els 


The § 
5 given py the formula 
d; d; 
dim yp, + LoExP 
gD.R. = Pim Pit 1,00 
| EPim t Ly ine 


This sort of standardized death rate is easy to compute and to 
The choice of standard population being subjective, may 
influence the comparison of standardized death rates. . 


Indirect Method. When the age-specific death rates for the given 
opulation are not available or there are fluctuations in age-specific 
qeath rates, we determine the number of deaths that would be expected 
in the given population, assuming that the age specific death rates of the 


standard population had prevailed for one year in the given population. 
The ratio of total deaths observed in the given population to expected 


deaths, multiplied by the crude death rate of the standard population 
yields the age-adjusted or standardized death rate by the indirect 


method. In other words, 


" Observed deaths in the given opulation x CDR of Standard 
population : 


S.D.R = : 
Expected deaths in the given ; 
3 population. 


explain. 


The number of expected deaths for the ith age-group is computed by 


the formula 


D; 
Expected death = = X Pi» 
t 


of deaths in t 


` 


where D; denotes the number 


4 ‘on during ayei 
standard population ; re in the ith age-grouP of 


3 i 
P; denotes the midyear populat a game year? an 
standard population uring fi i 
? i jon 1 
p; denotes the midyear! pa 
i ulation 1" i 
the given pop ven respet to sex Ta similar 


j wit ratel 
If adjustment is also t° P? made g separate 
adjustment is also © gles are est er ae the 


deaths b ale 
oth for males 4” em 
ad and the formula applied according 
Ormula 


Wt 


š INTRODUCTION TO STATISTICAL THEORY AL STATISTICS 
AL THEORY 


pppenoIX A: s 
529 


de Death rate for District A = 6 
Cru 6,100 * 1000 = 11.0 


145 
6.100 * 1:000 = 23.8 


Sdim t £d; x C.D.R of standard pop, X 


crude Death rate for Dit Be 


; d Po ulation) = f 

eC.D.R. (Standar P £ P; +È Pif To compute the standardized death rate for District B b; 

ethod, We first calculate the expected deaths in District “| ee 
naar 


can also be applied for computing the standardi , ane 
methods rdized pop ylation) as experienced in District B by the formula 


jes when the number of the employees and 
cording . tO suitable age-groups are available; true 
the number of persons and those ‘seeking 
ding to suitable age-groups. 


d; 
Expected death = a x P; 
i 


unemploymeD 


employment in @ These deaths are given below: 


Given the following data for two districts: 


2 


Į 
1 
5 
18 
710 
50 


` Compute Crude Death Rates for Districts A and B and point out the 
fallacies. Considering District A as the standard population, calculate the 


standardized’ death rate for District B, using direct method. 
(P.U., B.A/B.Se. 1967) 


Expected deaths in Standard 
Population as experienced in 
District A 


Example A.3. 


Age-group 
(years) 


Age-group 
(years) 


Solution: The computations for crude death rates and age-specific death 
rates in both the districts to point, out the fallacies. if any, a? given 


below: 
Popu- | Deaths aspR } Pope 
lation lation 
: 400 
g 


1 
l 


57 P 
Hence standardized death rate = § 100” 1,000 = 9.3 


; : „y large 
The crude death rate for District B, which contains & very larg 


ion i r the 
Moporti ý veit B'S population is over 
aana T A w P aberen e crude death rate 


ge 45 years, c h i 
, comes to 23.8 unusually hig) f iger people be» 
for District A, which contains a smaller propo" aa y 1a Th: 
| oe population is over the age 45 year ye haat than those in 
rates at all age-grou}s in more than double the | 


Age-group 
(years) 


0-14 


15-29 ‘Wola A but the crude death rate of B is mo | behaviour lies 1 
30-44 eath rate of A. The fallacy of this sort 0 unusu iie like. i such a 
45-59 a that the like is not being compares Cl standardizing has 
ati ? 5: re ca ; ; 
60-74 Meena standardized death a sich turns out to be 9.3. 
ed the death rate for District B, Ww ; death rates 


ty Example A.4 Calculate the crude 


} tm 
l the following data. 


INTRODUCTION TO STA j 
530 : TISTICAL THEORy pppENDIX A: VITAL STATISTICS 
[Standard] | DEn a i ieee 
: year and multiplied by 1,000. This is analogous to th 531 
tis computed by the formula e crude death rate. 
3 B 
231,428 CBR. = 3x 1,000, 
302,170 . 
293.149 where C.B.R. stands for crude birth rate, 
B denotes the total number of live bi . l 
pise year, and ive births registered during a given 
46,500 ; 
p denotes the midyear total population during the same year 


itis a rough measure of the production capacity of the population 
The fertility level of one area cannot be compared with that of anavher 


Solution. 
2,967 area by crude birth rates. 


Crude death rate for District A = 109,043 x 1,000 = 27.21 
? | The difference between the crude birth rate and the crude death 
29,035 rate presents the crude rate of natural increase of the population pe 
Crude death rate for Standard = 023,369 x 1,000 = 28.37 year per thousand people. It is corpus p per 
e standardized death-rates both by (i) Direct Crude rate of natural increase = _Births — Deaths _ x 1,000 
Midyear Population , 


The computations for th 


method, i.e. by the formula 
g lf births are mor 
population, but if deaths are more, persons are 


population. 


The crude birth rate is sometimes call 
to distinguish it from the “still-birth rate" which is com 


e than deaths, additions are being made to the 
being removed from the 


d; 
expected death = a" P; and 
i 
ed the "crude live-birth rate" 


(ii) Indirect method, i.e. by the formula 
puted as follows: 


D 
expected death = D x p; are shown below: 


l 


Still-birth Rate = i x 1,000, 
r, and 


Expected deaths in 
District A 
Gi) 


where Bs denotes the number of still-births during any yea 
e and still) in the 


B denotes the total number of pirths (both liv 
same year. 


It provides a measure of reprodu 


Expected deaths in 
Standard Population 


Age (years) 


0-9 
10-24 
25-44 
45-64 


ctive wastage. 
f foetus showiug nO 


Note, A still birth may be define toer 
evidence of life after comp ion or extraction, A of 
gestation has been reached. A $ called a f0° 


65 & over 
r 
A.16. Age-Specific Birth Rate. While a Jation, but 
i 28,538 7.89 tate, we have taken into consideration the pse a at pe i 
Standardized death rate (Direct) = aa x 1,000 = 21. the number of births depends upo? the number ° the parents, ats 
1023,36 thild-bearin Moreover pirths vary W! h the ag?. P puted for 
p aga Morg B.R.) are therefore generall c 


Age-specific birth rates (A.5.B. 


2,967 = 91.39 
: x 28.37 = f 
Women by the formula 


Standardized death rate (Indirect) = 9 973 
i ve pirths 


: aA IV 
total registered the game 


A.15 Crude Birth Rate. It is a ratio of 
opulation during 


during a calendar year to the tc.al midyear P 


INTRODUCTION TO STATISTIC = 
532 AL THEORy 


b: 
A.S.B.R = 5- X 1,000 
. Pir 


where A.S.B.R. stands for age-specific birth rate, 


bi - denotes the number of births registered during the Paea: 
women of ith age-group, and 


Pi, denotes the mid-year population figure of women in the same 
age-group. 
pecific birth rates are used to compare the natality of different 


Age-s' 
s account of the age-sex composition of 


areas as its computation take 
population. 

A.17. Standardized Birth Rate. The crude birth rates cannot be 
used for inter-area comparison because the number of births in any area 
depends upon the number of married women between the child-bearing 
ages. The crude birth rate for an area containing a very large proportion 
of population outside the child-bearing ages, is likely to be smaller in’ 
spite of the fact that the number of children per woman in the area is 
greater. Such sort of diversity is reduced by standardizing the birth 
rates. The standardized birth rate is computed in a manner siinilar to 
the standardized death rate by taking a standard population of women 
between the child-bearing ages given according to different age-groups. 


A.18. General Fertility Rate. Fertility means actual production 
of children. It should be distinguished from fecundity, which means the 
physiological ability to produce children, irrespective of whether or not 
children have been produced. It is to be noted that the opposite of 
sterility is fecundity and not fertility. Fertility is measured from birth 
statistics, but there is no direct measurement for fecundity. 


The general fertility rate is a ratio of all live births registered during 
a year to the number of women of child-bearing age. It is computed by 
the formula 


G.F.R. = 2 x 1,000, 
Pir 
where G.F.R. stands for general fertility rate, 
B denotes the total number of live-births registered during the 


year, and 
aring 


Piz denotes the midyear population of women of child-te 
age, 
R tentility rate is general in the sense that it attributes all births 
to all women in child-bearing age-groups. But the number of births 
depends upon the numer of married women of child-bearing age: 


ppPENDIX 


A: VITAL STATISTICS E 
reovel fertility varies bales a number of factors such as age dite. sy 
marriage, occupation, social class, religion, area of residence, etc. The 

> . t 


jertility rate specific for age is defined and computed by the following 


formula: 
B; 
Age-Specific Fertility Rate = a x 1,000, 
if 


ere B; denotes the number of live births occurring to mothers of the 
ith age-group during a year, 
P;-denotes the midyear female pupulation of the same age-group 
during the same year. 


wh 


The terms age-specific birth rates and the age-specific fertility rates 
are used interchangeably. 

A.19. Total Fertility Rate. The total fertility rate is obtained by 
aggregating the age-specific fertility rates for women of each 
reproductive age. When the age-specific fertility rates are given or 
computed for 5-year age-groups, the aggregate is to be multiplied by 5, 
the number of years in each age-group, as it is the sum of the rates at 
every individual age which is required. In other words, the total fertility 
rate is computed by the formula 


T.F.R. = 5 > (age-specific fertility rate) 


=5 2 = x 1000, 
i Pir 
where T.F.R. stands for total fertility rate, 
B; denotes the total live births in ith age-group; and on 
Pi-denotes the midyear female population 1n the same age-grou} 


during the same year. i 
The total fertility rate thus provides the to ta] group of 
would be born (ignoring mortality) to a hypot 
tates while passing through the reproductive pet 4 „ving the general 
ton, teens Reproduction Rates. While det 


sali ‘e not taken 
fertility rates, the sex of the child born and the mortality Reproduction 
into consideration. Taking sex into account, ` (N.R.R.), when 
Rate (G.R.R.) We obtain Net samen is made: 
adjustment in respect of both sex spe an thus 
The gross reproduction me ae age, te 
‘im of age-specific pirth rates of chi 
births only, It is assumed that 


(i) none of the female te : 
reaching the end of child- 


mber of babies who 
fertility 


ge, and 


INTRODUCTION TO STATISTICAL THEO 
RY 


ee enc er ere terre 
z a ie fertility rate would remain unchange through-out the 
child-bearing age. 
i tility rates are given or computed for 

When age-specific ferti are Or age. 
groups, each such rate should be multiplied by the number of years in 
each age-group and then adi.ed, or the resulting figure may be multiplieg 
by the span of age-groups instead of multiplying each age-group. The 
classification of female population of child-bearing age, taken as 15-49 
years, into 5 years age-groups such as 15-19, 20-24, ...., 45-49, yields 
seven age-groups. The G.R.R. is therefore computed by the formula 

ri 7 b. 
GRR =5 È ph 
i=1 if 
denotes the number of live female births registered during a 
year, to mothers of age i, (i is an age-group of five years); and 
P;pdenotes the midyear female population of the same age-group. 

Sometimes the births of both sexes may be given and the 
computations are made by using these births. The G.R.R. in such a case 
is obtained by multiplying the resulting sum by the ratio of females to 
total births. 

The gross reproduction rate estimates the average number of female 
babies produced by one married woman throughout her reproductive 


life. 

A.21 Net Reproduction Rate. It is a well known fact that all the 
female babies born do not survive till they reach the child-bearing age. 
To adjust the rate for mortality, we therefore take into account the 
survival rates (i.e. the probability of daughters surviving from birth to 
age-group cf mother) for girls born to women in different age-groups: 
The N.R.R. is thus defined by the average number of female babies who 
would become mothers when they attain their child-bearing age. It is 
computed by the formula 


where bir 


y Female births 
NRR. = 5 y emale births, probabili wi 
2 Female population x probability of survive 


Hence the N.R.R. estimates the number of female babies that would be 
produced by women throughout their life time if they were to experienc? 
at each age-group certain fertility and mortality rates. Thus it estimates 
the average number of potential mothers who survive. 

A net reproduction rate of 1 signifies that the present female 
population is exactly maintaining itself and the population is consider 
stable. If it is less than 1, it implies that the number of potential mothe 
15 decreasing and hence the population is declining. The population g 
this case is considered as heading towards extinction. If it is greater oe 
1, the population will increase. 


A: VITAL STATISTICS 
DIX : 535 


Example A.5. Calculate age-specific fertility rates, total fertility 


gross reproduction rate and net reproduction rate for the following 
Assume sex ratic at birth to be 105.2 per cent. 


Age-group Registered | Probability 
(years) Population Births of Survival 


rate, 
data. 


40-44 | 


Solution. A sex ratio= 105.2 per cent implies that there are 1052 
males for 1000 females. We first calculate the number of female births by 
multiplying the registered births of each age-group by the factor 
1000/2052. The female births are shown in column 4 of table given on 
next page. 

Next, we calculate the age-specific birth (fertility) rates for 1000 
women by the formula 

Age-specific fertility rate 

= Number of live births in ith age-group __ 1000 
= ‘Mid-year female population in ith age-group 


For example, age-specific fertility (birth) rate for age-group 15-19, is 


B; 1223 
pai t = = x 1000 = 90.78, 
A.S.B.R. = 5% 1000 = 75472 


A.S.B.R. = = x 1000 = 197.91, and soon. — 
of the table on next page. The 


These results are set out in column 5 
sum multiplied by 5 yields the total fertility 


Next, we compute age-specific bi 


the female births in the i 
population per woman. They appear ? 
page. These calculations T 
Specific birth rate for daughters 


as ott 
survival to find the expected survivors © 


These figures are given in column °. 


by 5 gives N.R.R. 


INTRODUCTION TO STATISTI 
CAL TH 
EORy 


Registered Female Age- Age- | Probability 
Births Babies specific specific | of survival 
fertility fertility 
rates per | rates for 
daughters 


0.0442 
0.0965 
0.1537 
0.0680 
0.0359 


Multiplyin 


Hence TF.R. = 5 È (age-specific birth rates) = 4,112 per 1000 women 
GRR. = 5D (age-specific fertility rate for daughters only) 
= 2.004, per woman; and 
N.R.R. = 5D (expected survivors of female births) 
= 1.93 per woman. 

Example A.6. Compute the gross and net reproduction rates for 
the following data: 
Age-group 
(years) 


Probability 
of Survival 


Female 
Population 
(000) 


Solution. The necessary computations are given below: 


Age-group Mean fertility Female offspring 
(years) 


of survivors 


18,900 

7558,000 7 0-022 0.012x0.914 = 0.011 
0.064 0.057 
0.061 0.054 
0.039 0.034 
0.021 0.018 
0.007 0.006 


0.001 


0.001 


apPENDIX A: VITAL STATISTICS 

_ Female births _ — 
Female Population ` 5 x 0.205 = 1.025 

N.R.R. = 5 x 0.181 = 0.905. 
The net reproduction rate turns out to b 
e less t igni 

that each mother produces less than one mother deel iai 
population is thus considered as heading towards extinction ats 


EXERCISES 


soe aimee cere ce: 
A2 Define vital statistics and discuss its scope and niu ié 
P.U. B.A/B. 

A.3 What are vital statistics? Describe Mi salen tt cath af 

vital statistics in Pakistan. Discuss its strong and weak points 

and suggest remedies. (P.U., M.A., 1963) 
A.4 Define and explain the following: 

(i) Crude Death Rate, (ii) Crude Birth Rate, 

(iii) Sex Ratio, (iv) Age-specific Death Rate 

(v) Infant Mortality Rate, (vi) Vital Index. (P.U., B.A./B.Sc. 1971) 
A.5 Define Death Rate, Birth Rate and Morbidity Rate. Justify your 
; definitions by taking an example in each case. (P.U., B.A/B.Sc. 1969) 
A.6 Differentiate between Rates and Ratios and explain what do you 


mean by the Crude Death Rates and Age-Specific Death Rates. 
i (P.U., B.A/B.Sc. 1970) 


(ii) sources of vital data, and 


GRR. =5xd 


A.7 (a) Explain (i) vital events, 
(iii) vital index. 
(b) Caiculate Age-specific Death Rates 
Pakistan from the following data: 


per 1,000 persons of 


Age 1964 estimated 1964 estimated 
(in completed Population, Both Deaths in 100 
sexes in 1000 


years) 
All ages 
Under 1 


1-4 

5-9 
10-19 
20 — 29 
30 — 39 
40 — 49 
— 59 

on a (P.U., B.A./B.S¢- 1970) 


INTRODUCTION 


TO STATISTICAL THE 
Nee ORy 
specific 
fertility 
rates for 
daughters 


Age- 
specific 
fertility 

rates per 


Probability E 


of survival: Xpecteg 


0.0442 
0.0965 
0.1537 
0.0680 
0.0359 


0.9694 
0.9668 
0.9632 
0.9584 
0.9519 


ultiplying by 5 Tor single age figures 
Hence T.F.R. = 5 > (age-specific birth rates) = 4,112 per 1000 women 
GRR. =52 (age-specific fertility rate for daughters only) 
= 2.004, per woman, and 
N.R.R. = 5 È (expected survivors of female births) 
= 1.93 per woman. 
Example A.6. Compute the gross and net reproduction rates for 
the following data: 


Age-group Female Female Probability 
(years) Population births of Survival 
(000) 


Solution. The necessary computations are given below: 


Female offspring 
of survivors 


18,900 - 


= = 1 
1558,000 0.012 0.012x0.914 = 0.01 
0.064 0.057 
0.061 0.054 


0.034 
0.018 
0.006 
0.001 


that 


IX A: VITAL STATISTICS 


_ Female births _ — 
Female Population — 5 x 0.205 = 1,095 
N.R.R. = 5 x 0.181 = 0.905. 
The net reproduction rate turns out to be | 
each mother produces less than one oka ae 
ion. The 


GRR. =5x2 


population is thus considered as heading towards extinction 


Al 
A.2 


AB 


AA 


A.5 


A.6 


A.T 


EXERCISES 

Describe the meaning of vital statistics and di i 
) ) nd disc 

shortcomings in detail. (PUBA Fa 3 
Define vital statistics and discuss its scope and initations” E 

, _ (P.U. B.A/B.Sc. 1969 82 
What are vital statistics? Describe the system for éollectiont e 
vital statistics in Pakistan. Discuss its strong and weak points 
and suggest remedies. (P.U., M.A., 1963) 
Define and explain the following: ' 


(i) Crude Death Rate, (ii) Crude Birth Rate, 
(iii) Sex Ratio, (iv) Age-specific Death Rate 
(v) Infant Mortality Rate, (vi) Vital Index. (P.U., B.A./B.Sc. 1971) 


Define Death Rate, Birth Rate and Morbidity Rate. Justify your 
definitions by taking an example in each cace. (P.U., B.A/B.Sc. 1969) 
Differentiate between Rates and Ratios and explain what do you 
mean by the Crude Death Rates and Age-Specific Death Rates. 

l (P.U., B.A/B.Se. 1970) 


(a) Explain (i) vital events, (ii) sources of vital data, and 
(iii) vital index. š 
(b) Caiculate Age-specific Death Rates per 1,000 persons o 


Pakistan from the following data: 


Age 1964 estimated 1964 eee 
(in completed Population, Both Deaths in 
years) sexes in 1000 
All ages 


Under 1 


1-4 
5-9 
10-19 
20 — 29 
30 — 39 
40 — 49 
50 — 59 
60 & over 


INTRODUCTION TO STATISTICAL THEOR 
Y 


ch at Ho the fang i 
A8 Calculate the age-sex specific death rates from the following data 


for selected age-groups: 
Age-group Midyear Population “ 


37,283 


Number of Deaths 


600 


615 


10-14 

fc 19 28,688 704 

20-29 57,626 104 

30-39 39,612 771 

rae 28,708 772 

50 — 59 18,395 806 
15,244 830 


60 & over 

A9 (a) Define Rates and 
them with examples. 

(b) Compute specific mortality rates from cancer per 100,000 


population from the following data: 


Age in years Population as of 
census data_- 


Ratios and explain the difference between 


Deaths from 
Cancer 


20 — 29 725,369 79 
30- 39 - 700,213 238 
40- 49 609,616 660 
50- 59 548,400 1,582 
60 — 69 402,054 2,342 
70 & over 287,280 3,402 


(P.U,, B.A./B.Sc. 1968) 


A10 Point cut the difference between crude and standardized death 


rates, and explain the direct and 
standardizing. Bring out clearly the uses of standardized rates. 
A.11 (a) Why are death rates standardized? 


(b) Calculate the Crude and Standardized Death Rates of 1964 


population by (i) Direct and (ii) Indirect methods, using 19 
population as standard. l 
ee A E 


(years) Deaths | Population Specific 
death rates 


10,000 220 15,000 
132 
105 
90 


0-9 
10-19 
20 — 49 

50 & Over 


indirect method of 


(P.U., B.A./B-50- 1983) 


ppPENDIX A; VITAL STATISTICS 
539 


int out the differenc 
A12 Poin erence between Crude and 
Standardi 
Rates. Calculate the crude and standardized sian Death 
Jocal population from the following data, using a iar of the 
y rect method. 


Standard No. of 
Population deathsin | p i No. of deaths in 
Standard opulation | Local Population 
Population j 


0-9 
10- 19 
20 ~ 59 

60 & over 


(P.U., B.A/B.Sc. 1974) 

A13 (a) Explain with suitable illustrations the object of standardizing 

various vital statistics relating to births, deaths and 
marriages. 

(b) Calculate the crude death rate and the standardized death 

rate for the data: 


District A Standard 
Age Population No. of deaths Population (’000) 
females 


years) [ males | females | 
0-4 2,110 
5-14 
15 — 34 


35 — 59 
60 & over 


ge-distribution of 


s the numbers and a 
cidents: 


A14 Following table contain 
and the number of ac 


male employees at two factories, 
Age e a a 
i employees 


(years) accidents 


400 


under 21 
21 — 29 
30 — 39 
40 — 49 
50 — 59 

60 & over 


ory: 


540 


A.15 


A.16 


A.17 


INTRODUCTION TO STATISTICAL THEORy 


stribution of Factory I as a standard ang 


(ii) Take the age di ; 
tandardized accident rate per cent for 


calculate the S$ 
Factory II. 

and net reproduction rates. Explain how you would 
et reproduction rate and what interpretations can 
1, less than 1 or greater than 1. 


Define gross 
compute the n 
be made if it is 
With the help of hypothetical example, explain the difference 
Net Reproduction Rates. Using some 
the interpretation of these rates in 

(P.U., B.A/B.Sc. 1967) 


between Gross and 
arbitrary figures, give 
population growth. 

Calculate age-specific fertility rates, total fertility rate, gross- 
reproduction rate and net reproduction rate from the following 
data, assuming sex-ratio at birth to be 106.18 per cent. 
Registered Probability 
of Survival 


Female 
Population 
(000) 


Age-group 
(years) 


27,639 
226,817 
280,506 
194,526 
113,966 

32,363 
2,215 


(P.U., B.A/B.Sc. 1996) 


A.18 (a) What is the Net Reproduction Rate, how is it calculated, and 


what purpose does it serve? 
(b) Calculate Net Reproduction Rate from the following data: 


Age-group | Female Female | Probability 
(years) Population Births 


of Survival 
87 4 


11 


(P.U, B.A/B.Se. 1967) 


JA 
A1 


A.20 


L STATISTICS 
54 


data: 


Female 
Population 
(000) 


-Female live 
births 


Survival 
rate 


15,138 
94,155 
102,676 
72,490 
31,402 
10,640 
700 


(P.U., B.A/B.Sc. 1985) 


(a) Describe methods of calculating gross and net reproduction 
rates. What are the relative merits of the net reproduction 
rate? 


(b) From the following data, 
reproduction rates, assuming se 


calculate the gross and net 
x ratio at birth to be 105.2 


per cent. 
Age-group Female Survivors 
(years) Population among females 
out of 100 


U., B.A/B.Se. 1984) 


1 


9 Compute the gross and net reproduction rates for the foll 
ollowing 


APPENDIX - B: Statistical Tables 
nt Ranges for Duncan ’s Multiple Range Test 


q. op, v) 


Table 1. Significa 


18.0 
6.09 
4.50 
4.02 
3.83 


3.68 
3.61 
3.56 
3.52 
3.47 


v = degrees of freedom. 


18.0 
6.09 
4.50 
4.02 
3.83 


542 


statistical Tables 


dix B 

App e R 543 

je 2. Significant anges for Duncan’s Multiple Range Test 
es 


fab 
G01, v) 


90.0 90.0 90.0 90.0 
14.0 14.0 14.0 14.0 
326 85 86 87 88 89 89 90 90 93 93 93 
T1 711 22 
5.96 6.11 6.18 6.26 6.33 640 644 65 68 68 68 


6.3 


6.3 
6.0 
5.8 


6.3 
6.0 
5.8 


5.95 6.00 6.00 
5.69 6.73 58 6.0 
5.47 551 556 5.8 
5.32 536 64 57 57 57 
5.28 5.55 5.55 5.55 


5.88 
5.61 
5.40 
5.25 
5.13 


5.20 5.24 


5.01 6.06 5.12 5.15 539 5.39 5.39 
5.26 5.26 5.26 


5.15 5.15 5.15 


439 4.63 4.77 4.86 4.94 
4.84 4.92 4.96 6.02 5.07 


4.88 4.94 4.98 


4.26 4.48 4.62 4.69 4.74 4.84 


ulag1 4.42 4.65 4.63 4.70 4.78 483 487 491 5.07 5.07 5.07 
16 goa’ âa aTi 48i 484. 500 5 5.00 
1413 4.34 4.45 454 460 467 472 4.76 4.79 4.94 4.94 4.94 
wl aio aao aal 450 486 463 468 473 4.15 4.89 489 os 
485 4. 
18 | 407 427 438 446 453 459 464 dig Die i 4.82 
‘19 | 4.05 4.24 4.35 443 4.50 4.56 4.61 4.64 4.67 4.82 ie en 
65 4.79 4 : 
ie 53 458 461 4 
402 4.22 4.33 440 447 4 
ii 
| 465 471 471 
gar 445 448 
|0 [389 406 416 422 49 eA T aan eo 
| 434 431 4 
‘| 1382 a99 410 417 424 4,30 jar 431 43i 453 p 
|e |376 3.92 4.03 412 4.11 423 ie gp 12048 4.64 465 
| 405 411 417 42l ute eee ee 
| 4.09 414 u 


tistical Tables Is 
~ a  G 


appends B--Sta 
t à 
Table 4. Areas for a Standard Normal Distribution 


INTRODUCTION TO STATISTICAL THEORY 


Table 3. The Normal Distribution 


2.5758 2.3263 2.1701 2.0537 1.9600 1.8808 1.8119 1.7507 1.6954 


(2) 
1.6449 1.5982 1.5548 1.5141 1.4758 1.4395 1.4051 1.3722 1.3408 1.3106 


1.2816 1.2536 1.2265 1.2004 1.1750 1.1503 1.1264 1.1031 1.0803 1.0581 
1.0364 1.0152 .9945 .9741 .9542 .9346 .9154 .8965 .8779 
3416 8239 .8064 .7892 .7722 .7554 .7388 .7225 .7063 
6145 .6588 .6433 .6280 .6128 .5978 .5828 .5681 .5534 
5044 .5101 .4958 .4817 .4677 .4538 .4399 .4261 4125 
(3853 .3719 .3585 .3457 .8319 .3186 .3055 .2924 .2793 
2533 .2404 .2275 .2147 .2019 .1891 .1764 .1637 .1510 . 
(1257 .1180 .1004 .0878 .0753 .0627 0502 -0376 .0251 
P F .001 .000,1 -000,01 .000,001 .000,000,1 

BE Z 3.0902 3.2905 3.8906 4.4172 4.8916 5.3267 

The value of P for each entry is found by adding the column heading 
| to the value in the left hand margin. The corresponding value of Z is the 
| ` deviation such that the probability of an observation falling outside the 
range frora -Z to +Z is P. For example, P=.03 for Z=2.1701; sc that 3 


per cent of normally distributed values will have positive or negative 
deviation exceeding the standard deviation in the ratio 2.1701 at least. 


be hdan rk dw 


2.4] 4821 
i 4881 |.4884 |- 
2.2|.4861 |.4865 | .4868 4871 | .4875 4878 l ‘4913| 4916 


23|.4893 |.4896 |.4898 | .4901 


| 

4934| .4936 
| "Table 3 is taken from Table 1 of Fisher and Yates: Statistical 24|.4918 | .4920° |.4922 |.4925 4927 a tok per 4951 | 4952 
Tables for Biological, Agricultural and Medical Research, published by 25|.4938 |.4940 |.4941 | .4949 nae 4960 | .4961 4962 | 4963) .4964 
| Oliver and Boyd Ltd., Edinburgh, and printed by permission of the | 26| .4953 4955 |.4956 | .4957 ao ‘oro | 4971 | 4972 4973 ard 
| authors and publishers." | 27.4965 | 4966 4967 | .4968 4960 | .4980| -49 


; 28| .4974 | .4975 |.4976 5 

| 29|.4981 |.4982 |.4983 |-4 : i "4989 
| E 3049865 | .4987 |.4987 | -4988 4988. | .4989 
| 3.1l49903 |.4991 | 4991 


Table 5. Fisher-z Values (zp) 


xt 
Se) 
© 
© 
m 
© 


INTRODUCTION TO STATISTICAL THEORy 


endix B--Statistical Tables 


p 
pe Table 6. Square and Square Roots 


sahi en 
z n2 Vn ivi 
I 
| ym) ma |in] (ajg lag | m 
J i x y e 7.48. 
m 1.44 1.095 3.464 5.1 | 3249 2387 1.550 
14| 196 1.183 3.742 SS | 3481 | 2429 | 7681 
S 225 1.225 3.873 6.C 36,00 2,449 7.746 
i 6 2.56 1.265 4.000 61 | 3721 | 2.470 | 7.810 
17 | 2.89 1.304 4.123 6.2 | 3844 | 2490 | 7,874 
ye | 324 1 342 4.243 6.2 | 39.69 : 2.510 | 7.937 
1.9 3.61 1.378 4.359 6.4 | 40.96 2.530 8.000 
2.0 4.00 1.414 4.472 65 | 4225 | 2.550 | 8,062 
21 441 1,449 4.583 6.6 | 43.56 2.569. 8.124 
22 4 84 1.483 4.690 6.1 | 44,89 2.558 8.185 
23 5.29 1.517 4.796 68 | 46.24 2.608 8.246 
24 | 5.76 1.549 | 48.99 69 | 47.61 | 2627 | 8.307 
1,581 5.000 7.0 | 49.00 | 2.646 | 8.367 
a6 E Kéi 5 099 71] 5041 | 2665 | 8.426 
27 7.29 1.613 5.196 72 | 51.84 | 2.683 8.485 
28 7.84 1673 5.292 13 53.29 2.702 8.544 
2.9 8.41 1.703 5.385 T4 | 54 f i 
30 | 900 į 1.732 | 5.477 ns | $625 | 279 | ais 
3.1 9.61 1.761 5.568 16 56E | aii | ens 
3.2 | 10.24 1.789 5.657 è . ae et 
33 | 10.89 1.817 5 745 78 | 60 84 | 2003 | Sase 
34 | 11.56 1.844 5 831 19 i i ne 
828 
35 | 1225 | 1871 | 5.916 80 | rel Zate | 9:000 
3.6 12.96 1.897 6.000 . 61.24 2.864 9.055 
37 | 1369 1.924 60n HE) 68°39 2.81 au 19 
8 949 . F 2 . 
39 15 2 11975 6.245 84 | 70.56 ai p 
g.s | 72.25 . 9,274 
40 | 16.00 2.000 HAF gé | 73.96 | 298 | 3:327 
4.1 16 8! 2.025 5 81 15.69 i 9.381 
42 | 17.64 | 2049 | 6.481 gg | 7744 | 2366 | S434 
43 | 1849 2.074 | 635 gs | 79.21 | 2 
44 | 19.36 2.098 6.633 3,000 | 9.487 
45 21 | 6.708 ac | far | 307 | 33 
46 | drag | 2.145 6.183 92 gat | 305) | 9.644 
47 | 22:09 | 2168 | & 93 | 8642 | 3065 | 9695 
4.8 23.04 2.191 as 9.4 | 88.36 
49 | 201 | 2.214 | 7° gs | 90.25 | 20m Het 
so | 2500 | 2.236 | 797 9.6 ay zils | 2808 
St | 26.01 | 26258 | 721 21 | 9604 | 3132 | 5'950 
5.2 27.04 2.280 7.280 5 98.01 3.1 
5.3 28.09 2.302 7348 i 
5.4 | 29.16 2.324 


547 


1. 


16. 


17. 


INTRODUCTION TO STATISTICAL THEOR 
Y 
REFERENCES 


Armitage, P and Berry, G — Statisticaı Methods in Medi 
Research, 2nd ed. Blackwell Scientific Pabian. 


London—1988. 


Bhattacharyya, G. 
and Methods. New 
Bowen, E.K. and M.K. Starr — Basic Statistics for Business and 
Economics. McGraw-Hill Book Company—1982. 


K. and Johnson R.A. — Statistical Concept 
York, John Wiley & Sons. — 1977. a 


Chao, L.L. Introduction to Statistics—Monterey, Cali 

Books/Cole Publishing Co., 1980. i 

Cochran, W.G. Sampling Techniques—3rd ed. New York, John 

Wiley & Sons—1977. ? 

Dixon, W.J. and F.J. Massey, Jr. — Introduction to Statistical 

Analysis, 3rd ed. New York. MeGraw-Hill Book Company, 1969. 

Francis, A — Advanced Level Statistics. Stanley Thorner 

(Publishers) Ltd. Cheltenhem, Glos. — 1986. 

Freund, J.E.—Modern Elementary Statistics, 5th ed. Englewood 

Cliffs, N.J. Prentice Hall, Inc. 1979. 

Gilchrist, Warren_Statistical Modelling. New York, John Wiley 

& Sons, Ltd: 1984. 

Groebner, D.F. and P.W. Shannon. — Business Statistics. Ohio, 

Charles E. Merrill Publishing Co. 1981. 

Guenther, W.C. _ Concepts of Statistical Inference. New York, 

McGaw-Hill Company — 1965. 

eign: C.R. — Fundamental Concepts in the Design of 
periments. New York, Holt, Rinehart and Winston, 1964. 

nag) i W, and Montgomery, Dougles C — Probability 

: ai istics in Engineering and Management Science. 2nd ed. 
ohn Wiley & Sons — 1980. 

H r 

si P.G. — Introduction to Mathematical Statistics: 4th ed. 
ew York, John Wiley and Sons. i 

Hugh ; i 

peeks a D. Grawoig — Statistics: A Foundation for 

: Cali., Adison-Wesley Publishing Company Inc. 1971. 


Innes, A.E._Busin vege : 
‘Ltd ess S illan 
Press Ltd. —1974. tatistics by Examples. The Macm 


Kotz, S and Johnson, N : : 
ý NL. E i Stati 1 Sciences 
New York, John Wiley & oe co 


<gofer once = 
Lapin, L.L.— Statistics: Meaning and Method, : 
Harcourt Brace Jovanovich Inc. 1980. jai i a 


Larson, H.J. — Introduction to the Theo ae 
york, John Wiley & Sons, Inc. 1973. ry of Statistics — New 
Richard J. & M H 

90. Larsen, M arx, Morris L — An Int : 
Mathematical Statistics and its Applications. ma a Wee 
Prentic-Hall, Englewood Cliffs, N.J. 4 

21. Mann, Prem S — Introductory Statisties. 2nd ed. 1995, Joh 

Wiley & Sons, Inc. , John 


22. McClave, James T. and Bensen, P. George — Statistical for 
Business and Economics. 5th ed. 1991, Maxwell Macmillan 
International, N.Y. 

23. Mendenhall, W.—Introduction to Probability and Statistics, 5th 
ed. North Scituate Mass: Duxbury Press, 1979. 

94, Miller, J.C.—Statistics for Advanced Level. Cambridge 
University Press—1985. 

95. Mosteller, F., R.E.K. Rourke, and G.B. Thomen, Jr._Probability 

with Statistical applications—Reading, Mass: Addison-Wesley 

Publishing Co. Inc. 1970. 

Mood, A.M., F.A. Graybill a 

theory of Statistics, 3rd ed. McGraw- 

Ostle, B. and R.W. Mensing Statist m 

+ naity- Press, mes, owa, 
ieee ti joni istics for the Behavioral Sciences. 
i arame 
Siegel, S.—Nonp' ‘Hill Book Company, 19 6 
. (4 


nd D.C. Boes Introduction to the 
Hill Inc. 1974. 
Research, 3rd ed. The 


29. Sprent, Peter - Appile 1989. 
Chapman and Hall, 

30, Steel, Robert GD. 

| Procedures of Statistics, n 

| 1980. jon t0 Statistics. 3rd ed. New York: 

| 31. Walpole, R.E inc. 1982 


, Be a 
Macmillan publishing i 


| 82. Walpole, R.E. and 
| Engineers ee 1978. 


Publishing V°: retical Stat l 
33. Wilburn, Ag: pker, Inc. 1984. odut Statisties for Business 
e Pu John wiley & Sons. 


ames g.—Principles and 
ed ‘McGraw-Hill Book Company, 


Statistics for 


82. 
ility and 
ies York. Macmillan 


ampling for Aud 


2n 
itors. New 


York, Marce nac 
84, Wonnacott 8° por Yor 
and Econo™ 8) 


14.16 


14.22 


14.26 


14.27 


14.28 


14.29 
14.30 
14.31 


14.32 


14.33 
14.35 
14.36 
14.37 
14.38 
14.39 
14.40 
14.41 
14.42 


ANSWERS TO EXERCISES 


_ Chapter 14, Pp. 56-65 


distinct samples of size 2 are (2, 4), (2, 6), (2, 8), 
(4, 8), (4, 10), (6, 8), (6, 10), (8, 10). 

e 3,45, 6, 5, 6, 7, 7; 8, 9. 

1, 4,9, 16, 1, 4,9, 1, 4,1. 


The possible 
(2, 10), (4, 6), 
The sample means ar 
The sample variances are 

w= 6,07 = 8 1b, = 6, E(S?) = 5. 


(a) nı = 8; ny = 9 ny = 11, ny = 12 
(b) nı = 900, ng = 1800, ng = 1500, n4 = 1200, ng = 600. 
WO 4 45 5 5.5, 6 65, 7, 75 8 
Fi 2 3, 4 5, 4 3 2 1 
(ii) Hz = 6, Ox = 1. 
i) 1 
(px, 3 5 6 7 8 9 10 


fl) 1/15,2/15,4/15, 2/15,1/15,4/15, 1/15 
(a) z 2025 30 35 40 45 50 6.0 

fæ) 1/81,4/81,10/81,16/81,19/81,16/81,10/81,4/81, 1/81 
(iii) 5.14 + 3(1.81) 
20, Hz =9 


5.5 


x 3 4 5 6 7 8 9 10 11 12 
fŒ) 1/64, 3/64 ,6/64 10/64, 12/64, 12/64, 10/64, 6/64, 3/64,1/614 
u = 7.5, 0° = 11.25; aor o% = 3.75. 
RE. 
(a) p = = 5.2; (b) (i) o = 0.87, (ii) O; = 1.73; 
(c) 0.87; (d) 1.73. 
2_ 2 
OZ = 5/3, O med) = 7/2, z= 5 = H= timed) 


(a) Approximately normal; (b) 0.0668. 


0.5762. 

(a) Hz = 68.5, Oz = 0.54; (b) 154 

(i) 11, Gi) 30. 4 

(a) 0.80; (b) (i) [lz = 5.3, on = 0. 0225, (ii) 0.9082 
(b) 0.0764. 

(a) (i) 0.057 (ii) 0.013, (iii) 0.930 (b) 9; 

(b) (i) 0.0127, (ii) 0.8561 


550 


| 


answERS TO EXERCISES 


14.48 


14.49 


14.51 
14.52 


15.4 


15.5 
15.6 
15.9 


15.15 
15.16 


15.17 


15.20 
15.22 


15.24 


15.25 
15.26 


| 15.27 


14.50. 


551 
oD g | 
16/81, 8/81, 4/81 


TTET T7731 TT 3 


= 2.33, Ò` 


Hz x, 7 
X\-Xo 
0.0043, 14.45. 0.7070 
The sampling distribution of proportions i 
S: 
Pp 0/3 1% 2/3 ` 3/3 
9720 9/20: 1/20 


sie 
t 


fp) 1/20 
Hp = 0. 5, Var(p) = 0. 05 


The sampling distribution of proportions is: 
p. 0 1/3 2/3, 1 
fe) 1/35 12/35 18/35 4/35 

= 4/1, o; = 547 
ie G) 0.0877 (ii) 0.0065 (b) 0.0228 
0.9774 
(b) 0.6826. 


5/25 6/25 4/25 2/25 


Chapter 15, Pp. 113-122 
(a) X = 5.9, variance = 5.29, mg = 6.288 
(b) (i) 12.17, 0.21; (ii) 12. 4l, 0.14. 
(b) u = 15, 0? = 11.67; E(X) = 15, E(s*) = 11.67. 


8/25 


(b) 89%. 
(a) (i) X, and X» (ii) 140.6%, 71.1% (b) (i) Ty Tz and Ty 
(ii) Ty. 
i ies x z A _ X 
Op= 3 Gps 
(b) 33.43, 5.10. 

A A rx? 
(a) A=% (b) ge 
(b) 77.28 < p < 85-12 15.21 (b) 63.61 < E < a 
(b) 95.06 < p < 104-49 15.23 (b) ~227< H<] 
(a) 90% (b) 1010.9 < u < 10141 

| < 197.84 

nieret g< < 3.515 


(a) 30 < p < 40; D) 3.48 


166.42 < H < 177.58 


552 
15.28 


15.29 
15.31 
15.32 


15.33 


15.34 
15.35 
15.36 
15.37 
15.38 
15.39 
15.40 


16.5 
16.6 


16.7 
16.8 


16.10 
16.11 


16.12 


16.13 


16.14 


16.15 
. 16.16 


16.17 


16.18 


16.19 
16.20 
16.21 


16.22 
16.23 


INTRODUCTION TO STATISTICAL T 
5 . < 85.54. 
x = 85.08, s, = 7.2372, 84.62 < UL 


(b) -1.6 < u; ~ H2 $9. 

(b) 4.45 < Hy — Hy $15.55 
(a) 33.26 < p; < 34.60; 
(c) 5.59 < H~ H2 <$ 7.39 


(a) 1.55; (b) 0.26 < H~ He < 2.84. 


53.63 < H4 — He < 70.37 
(b) 0.053 < p < 0.107, (c) 0.1206 to 0.3060 


(a) 0.504 < p < 0.696 (b) 0.592 < p < 0.664, (c) 0.701 < p < 0.819 
(b) —0.088 < pı — Po < 0.188. 
—0.1414 < pı 7 P2 < —0.0986. 
(a) 0.556, (b) 1005. 
(a) 103; (b) 97; (c) 225 
Chapter 16, Pp. 161-168 
(b) a = 0.0548; B = 0.3504, 0.6177, 0.8281 
(a) (i) 0.823, (ii) 0.583 (b) (i) 195; (ii) 0.0548; (iii) 0.9452. 
(b) Hy: p < 800 and H; : [1 > 800; X is the test-statistic. 
B are: 0.1977, 0.4920, 0.7939, 0.9995 
Powers are: 0.8023, 0.5080, 0.2061, 0.0005 
(a) z} = —11.96, z, = —8.04 so B < 0.2; 
n = 78. 
(b) z = 4.67; reject Hy: pH = 75. 
(b) z = -2.88; reject Hy in favour of H} : pt < 45. 
(a) z = 1.23; accept Hy. (b) z = -1.93; reject Ho. 
(a) z =-0.45; accept Hy, (b) z = -2; reject Ho 
z = 8.97; reject Ho. 
(a) z = —4.69. The sample would not be so regarded. 


(b) 26.84 < Hg < 28.04; 


(b)n = 19 


(b) z = -1.78. The sample can be so regarded. 
(a) z= 5. The process is out of control 

(b) z = —2.5; reject the claim. 

(a) z = 0.534; accept Hy (b)z = 4.22; reject Ho 
Z = —2.38; reject Ho 

z = —2.48, significant at 5%; insignificant at 1%. 
z = —2.00; accept Ho 


(a) a 9.92; reject Hy. Australians are on the average talle 
Englishmen. 


r than 


yswERS TO EXERCISES 


A 
= 4.24; rej = 
(b) z reject Ho. The two brands differ in quality. 


16.24 ; 
16.25 (b) z = 1.94; reject Hp. It could be con 


z = 2.16, accept Ho 


has improved. cluded that his shooting 


16.26 (a) z = 2.63; reject Hy. The coin in biased, 


(b) z = 0.51; accept Hy. The data are consi , 
tna $8 sist: 
sex division in the population. stent with an equal 


16.27 z= 3.59; reject Hy in favour of H,. 

16.28 (a) z = 0.487; accept Ho. The given information i i 
with the hypothesis. i inte 
(b) z = —1.80; reject Ho. 


16.29 (a) z 
(b) z 

16.30 z = —5.7; reject Hy. The company’s claim cannot be accepted. 

16.31 (a) z = 1.27; accept Hy. The difference is not significant. 
(b) z = —3.12; reject Ho. | 

16.32 (a) z = 0.10; accept Hp. The machine has not been improved. 
(b) z = 2.10; reject Hy. The blue envelope will help sales. 


i] 


—4.73; reject Ho. The claim is not legitimate. 
3.21; reject the claim. 


16.34 z = —1.025; accept Ho 
16.35 Accept Hy: p = 0.05. The coin is fair. 
Chapter 17, Pp. 222-238 
17.4 (b) (i) 55.47, 78.80, 129.64. (ii) 55.57, 79.10, 129.93. 


17.5 (b) 13.82 < 0? < 65.53. ; 
17.6 (a) 0.135 < 02 < 0.953 (b) 0.0286 < 0? < 0.218 


17.7 (a) 15.00 < o? < 64.05 (b) 9.34 < 0? < 49.55 
17.8 s? = 30.83; 24.20 < o2 < 40.83. 


17.9 (b) %2 = 10; accept Ho 52 < 22.74 


17.10 (a) %2 = 15.75; accept Ho : O 


; Ho: 
(b) x2 = 33.33; accept #0 ssi 
17.11 (a) y2 = 24.108; reject Ho at ppan aea 
significance. 
(b) x2 = 8.11; accept Ho 4 = 16 accent Ho 
17.12 (a) y2=46.2; accept Ho (b) x = o nay be regar 
lation varianc variances: 
17.13 (b) u = 1.46. Popu f equal 


othesis 0 : 
17.14 (b) u = 0.47; accept the hyp? ake 34.73; reject Ho 
ept 11o: (b) X% . reject Ho 
17.18 (a) x2 = 5.00; aceP 0 p) y? = 10.14 


2 = 4.23; accept 70 


1 


2 = 20; 8.65 < 


at 1% level of 


that o = 0.9 years. 
ded homogeneous. 


17.19 (a) x 


= y? = 3.78; accept Ho. The data at apvietereny with theory. 
11.21 (@) y2 = 4.50; reject Hy. The coin is not fair. 

o) x2 = 4.07; reject Ho. The dice are not fair. 
17.22 (a) y2 = 0.87 or 2 = 0.61; accept Ho of equal SeXdivision. 

b) xX = 2.60 or z = —1.94. Unable to say that treatment is 

effective. 

17.23 (a) y2 = 8.47; reject Ho. The data contradict the stated 
. hypothesis. 

(b) x2 = 10.0; reject Ho. The distribution of grades is not 

uniform. 

17.24 (a) X? 7 23.28; reject Hy. —(b) y2 ='57.90; reject Ho. 
17.25 (a) X? = 18.73; reject Ho that die is balanced. 

(b) Assuming that the total number of births is evenly spread 
over the period (i.e., on the basis of number of days in the 
various months), the expected number of births is 50,959 for 
a month of 31 days, 49,315 for a month of 30 days and 46,027 
for February. 

x? = 322.62; reject Ho- The data suggest the presence of 
seasonality. 
17.26 (b) x? = 19.63. The data are not consistent with the hypothesis. 
17.27 (b) x? = 2.79. The data conform to the binomial distribution. 
17.28 x2 = 5.80. The dice were unbiased. 
17.29 (a) x4 = 4.74; accept Hp. The coins are well-balanced. 
(b) x2 = 1.12; accept Hy. The fit is good. 
17.30 2 = 1.39. The fit is good. 
17.381 %2 = 3.52. The data fit a Poisson distribution. 
17.32 2 = 21.04; reject Hy. The data do not fit a Normal distribution. 
17.33 2 = 2.101; accept Hp. The data follow a Normal distribution. 
17.34 (b) (ABC)=12, (ABC)=16, (ABY) = 32, (ABY) = 2 (aPC) = 107, 
(By) = 66, (aBy) = 25 and (ABC) = 240. 
17.85 125. 
17.36 (b) (By) = -57. The numbers reported in the various groups 
are not consistent. 
17.37 Attributes are independent. 
17.38 (i). Positive association. (ii) Negative association. (ii) A and B 


INTRODUCTION TO STATISTICAL THEORY 


are independent. 
os os : 
(b) %2 = 5.489; reject Hp. The results are associated. 


ANSWERS TO EXERCISES 


17.42 


17.43 
17.44 


47.45 
17.46 
17.47 
17.48 
17.49 


17.50 
17.52 
17.53 


17.54 
17.55 


17.56 
17.57 
17.58 


17.59 
17.60 


18.4 
18.5 
18.6 


18.7 


555 


( X 25; reject Hy. The two classifications are as i 
7 1 . 


(b) X? = 4.87; reject Hy. 
P , 

x2 = vT Attributes are highly associated 

(a) %* = 1.85; accept the hypothesis of independence 


Ža ‘ej 
(b) X° = 26.66, reject hypothesis of independence 
x? = 162; accept Ho. | 


x? = 26.6. There is association. 
x2 = 69.15; reject Ho 
x2 = 134.55; reject Ho. 


x2 = 51.28; reject Hy. The claim status is not i 
`  policyholder’s age. us is not independent of the 
Ta A 
x2 = 7.47; accept Ho that two variables are independent. 
%2 = 6.11, C = 0.12. Accept Hp. 
2 (wi Ranat ' 
x? (without Yates correction) = 4.62; ¥2 (with Yates’ correction) 
= 4,01; reject Hy. 
(b) X2 = 0.57; accept Ho. 
(b) Reject the null hypothesis. 
(c) p = 0.24; reject Ho 
(a) p = 0.1098; 
(b) p = 0.3114; not independent. 
(a) x? = 6.29; accept Hy that the proportion of defectives is 
about the same for all three shifts. 
(b). %2 = 110; reject Ho - 
(b) x? = 9.65; 
x2 = 27.95; reject Hp. 
x2 = 6.52; accept Hy that ther 
two groups. 
Chapter 18. PP- 265-272 
a) —5, Gi) 45, Gi g, (iv) = 26 
(b) 569 < H < 581. l , 
(a) 9.84 < H < 12.16 when O 15 known; 
is unknown. 
(b) (-2.27, 2.67 
known. 
(a) 87.0 <H <1 
(b) 12.3 < H < 167 


e is no significant difference in 


9.48 < H < 12.52 when O 


) when O js knowns (-1.68, 1.88) when g is not 
w 


07.4. 


A 
N TO STATI Eee 
INTRODUCTIO STICAL TH 
556 EORY vie ees 19, Pp. 290 . — 557 
52 < u < 35.28 
18.8 33.52 < H A 19.4 (b) 1.380 < 01/07 < 7.50. 
18.9 (a) 0.3 < H4- Hg < 15. 
(b) —4.6 < fl, — Hg < 16.0 (c) 0.68 < 51/0 < 6.59. 
í - 15.80 2 
18.10 8.80 < H4- H2 < 19.5 (b) 0.36 < 03/02 < 7.95; 0.60 < o Jo 
18.11 0.50 < ply — Hg < 1.74 i D 1/92 < 2.82. 
18.13 (a) t= 1.0; accept Ho (b) £= 1.5; accept Ho. (c) 0.152 < 03/0} < 1.84, 
(c) t = 0.44; accept Ho (d) t= —1.25; accept Ho. 19.6 (b) (i) F = 3.125; reject Hy, 
(e) t = —0.74; accept Ho. Gi) F = 2.48; accept Hy (the role of two samples 
18.14 -t = —4.45; reject Hp. The given values are not consistent. interchanged). en ea 
18.15 (a) t = 1.84; accept Hy, (b) £ = —1.71; accept Ho. Gii) F = 0.47; reject Hg (two-tailed test), 
18.17 = 3.16; reject Hy at 0.05 level but not at 0.01 level. It is 19.7 . (a) F = 5.33; reject Ho (two-tailed test) 
advisable to check the machines. : F = 0.18; accept Hy (one-tailed test). 
18.18 t = —0.86; accept Ho. The claim is vindicated. (b) F = 1.57. Two methods of teaching are equally variable. 
18.19 (b) t = 0.42; accept Hp. l 19.8 7 F = 1.28; accept Hp. 
18.20 t = 1.007; accept Ho. 19.9 F = 4.287; do not reject Ho. 
18.21 (a) t = —0.099; accept Hy. The soldiers are on the average not 19.10 (a) F(6, 5) = 1.28; difference is not significant at a=0.05 ie 
taller than sailors. tailed). 
(b) t = 3.05; reject Hy. Electrification does exert some effect on | (b) t = 0.54; difference is not significant at a = 0.05. 
tillering. i 19.11 F = 5.88; variances are not equal. 
18.22 = —0.68; accept Ho. 19.12 F = 1.69; accept Hp. : 
18.23 (b) t = 1.386; accept Hp. (c) £ = 2.608; reject Ho. Chapter 20, Pp. 336-346 
18.24: (a) t = 1.27; accept Hp. 20.7. (a) The nel of variance table is 
(b) t = —0.39; accept Hy. No significant difference in the strength 
of the two tyeps of ropes. 
18.25 i = 1.32; accept Hp. 
18.26 t' = 2.82 with v=14; reject the hypothesis of equal means. 
18.27 (b) t = 1.17; accept Hy | 
mate ` 20.8 (a) The analysis k N vari 
18.28 (a) t = 3.44; reject Ho. | errr E E 
(b) t = 0. 86; accept Hy. The same conclusion has not been [Source of Variation _ 1166. 95 
reached. as ‘ 
, Within machines Holod 
__18.29 (a) t = 2.16; reject Ho. Food B is better than food A. — emacs t with respect to items 
(b) t = 4.32; reject Ho. Food B is better than food A. are [Total aN differen 
18.30 t = 1.71; accept H The machines 
18.31 t = 2.48; r 4 t îr c d h radial tires give | produced. j 
; ~ eject: Họ. Cars equipped with radial tir 20, ee 
better fuel economy. “0.9 (a) The ANOVA 
18.32 t = —1.95; accept Ho. 


Source of Var jation 
Between salesmen 
Within salesmen 


558 


20.10 


20.11 


20.12 


20.13 


20.14 


20.15 


INTRODUCTION TO STATISTICAL THEORY 


Reject H p. The differences between salesmen are significant. 


(b) The ANOVA Table is: 


Ea ee 


Treatments ee 359.19 119.73 5.95 
12 


Residuals 258.25 21.52 
ma OS l 


The treatments differ signficiantly. 
The ANOVA Table is: 
a ee a L r 
Between Tube types 190.08 95.04 2.86 
Error 21 697.75 33.23 


ie |a ee || 


Accept Ho. 
The analysis of variance table is: 


-Between Samples 4 1225.6 306.4 3.59 
Within Samples 3 
3 


2986.0 85.31 

[Total | 39 422.6 | ~- | 
Reject Ho. 

The ANOVA table is: 


ba 


5 
9 
5 
7 


Between Groups 2 91.40 45.70 2.10 
Within Groups 1 827.04 21.80 
cr aa T A l 


Accept Ho. 
The analysis of variance table is 


Between Methods 2 22.38 11.19 
Within Methods 10 16.99 1.70 
ol] | ao | 


Reject H,. The methods differ significantly. 


The ANOVA table is: 
k 62.28 


Between Teachers 2 124.55 
29 8720.33 | 300.70 
Race a a 
Accept Ho. There is no significant difference. 


(a) u = 2.18. The variances are homogeneous. 
(b) The analysis of variance table is 


ae 


0.21 


20.20 The analysis of variance table is: 


ANSWERS TO EXERCISES 


Between samples ~ 
Within samples | 99 


Accept Hy: all mean 
a S are 
20.16 (b) Thé ANOVA table is: equal. 


Between samples d 

AWI 38 10 

Within samples 16 re Bi Biss 
Reject Ho. 

(a) The ANOVA table is: 


a a 
Types of plants 2 14.9452 | 7.4726 | 13.70 
Error A 51 27.8135 | 0.5454 . 

[Total [88 antes PP 
The rubber content of the three types of plants is different 

(b) (i) t = 0.26; accept Ho. (ii) |t] = 5.05 with 51 d.f. l 

20.18 The analysis of variance table is 


Within schools 272 25977 95.5 
Toa — [ou aa] 
Reject Hy. The difference between the means of students in 
the different types of school is significant. 
20.19 The ANOVA table is: 


d.f. 
3 


20.17 


| Ms |F 


Between Groups 


Between Rations 


Error 


e is a significant difference between rations at the 5% 


Ther 


level. 


Factor A 
Factor B 
Error 


A I th SNS 


INTRODUCTION TO STATISTICAL THEORY 


560 
Accept both the hypotheses. 


20.21 The ANOVA table is: 


d.f. 


[af 
3 

Error 6 

ae pea e 


There is significant difference both in breeds and between 


Between Breeds 
Between Rations 


rations. 
20.22 The analysis of variance table is: 


F 
4 
3 


46.20 11.55 2.03 


Between varieties 28.55 9.52 1.68 
‘Error 12 68.20 5.68 


mj -| | 


These data could have arisen from a population in which 
there was no difference between the yields of varieties and 
the fertilizers did not differen in their effect. 


20.23 The analysis of variance table is: 
7 
Between subjects 4 45.28 11.32 
Between treatments 3 23.24 7.15 
Error 27.94 2.33 


saj - |- 
There is no difference between the treatment means. 
20.24 The analysis of variance table is: 


Between Salesmen 3 600 200 

Between Districts 2 3200 1600 

ma fa 6200 [= 
The salesmen were equally capable and that all districts were 


`“equelly profitable to work. 
20.25 The analysis of variance table is: 


Between fertilizers 


595.60 
1329.73 


Between Subjects 
Between Students 


Error 813.07 


a [esaj e | 


-a J P 
| 20.35 (i) The ANOVA-tabls is 


iat 


sens TO EXERCISES 

Accept both the hypotheses. 
90.26 The analysis of variance table is: 
S.V. 
Columns 
Rows 
Interaction 
Error 


561 


Diets 
Drugs 
Interaction 
Error 12 
a | oss | | 
Differences between drugs are significant but the differences 
due to the interaction of drugs and diets are not significant. 


20.30 The analysis of variance table is: 


Varieties 4 270.27 67.57 6.39 
Error 10 105.68 10.57 --- 
Total mas | ~~ |= | 


Reject Ho. (Fo,95(4,10) = 3.48) 
20.81 The analysis of variance table is: 


Between Coater types 
Between Days 
Error 


Difference betwe 
M L 
7. 421 413 481 5.27 


RA a RRR pte 


562 __ INTRODUCTION TO STATISTICAL THEORy 
| (iii) F=4; reject Ho at the 5% level. 


(iv) One set of orthogonal contrasts is: 
C = T4-T2 
C= Ta + To? T.3 
C3 = T.i + Tot T.a- 3T. 
Chapter 21, Pp. 388-396 


21.3 1.03 < B < 2.09. 


21.4 (i) b = 4.4096. 
di) -263.08 < a < 10.11 or 159.87 < A < 168.93 if 


Hy.x =A +B- X); 2.56 < B < 6.26 

21.5 (a) Y = 99.13 + 0.502X, or Y = 121.3 + 0.502 (X — 44.21) 

(b) (i) 86.7 < & < 111.55 or 118.16 < A < 124.44 if 

Hy-y = A + B(X - X); 0.2258 < B < 0.7782 
(ii) Prediction interval: 97.89 < Y < 150.57 

21.6 (i) 0.92 < B < 1.09; 29.34 < Ly y=30 < 30.46 

(ii) t = 0.119; accept Ho 
21.7 = -179.41 + 5.03X; t = 6.89, reject Ho. 
21.8 Ê= -1 + 2X, t = 4.35, reject Ho. 
21.9 (i) £ -= 9.23; reject Hy: B = 0. 

(ii) £ = —3.188; reject Hy: B = 6. 

(iii) 156.49; 189.26 < Y; < 173.72 (prediction interval) 
21.10 Y = 30.056 + 0.897X; (i) t = 5.37, reject Ho. 

(ii) t = —0.167, accept Hy: & = 32. 
21.11 (b) (i) b; = 1, by = 1, 

(ii) -4.14 < B; — Ba < 4.14; t = 0 so accept Ho : Bi = Bo 

21.12 (i) b] =1,b= 1, by = 0.5454. 

Gi) For Ho: Bi = Ba, t = 0.43, accept Ho. For Ho: By = Bs 

¢ t= 0.83; accept Ho. For Hy: By = Ba, t = 0.43; accept Ho. 

21.13 F = 14.60; regression is non-linear, 
21.14 F = 0.60; regression is linear. 
21.15 F = 1.86; regression is linear. 


21.16 (b) 9.100 < p < 0.297 
21.17" (a) 0.80 < p < 0.78 
21.18 (a) 0.47 < P < 0.87 (b) 0.771 < p < 0.973 

21.19 (a) z= —1.19; accept H}. (b) z = 1.62; accept Ho 

21.20 (a) z = -1.09; accept Ho;values are consistent with hypothesis 


(b) 0:503 < p < 0.640 


| 


swERS TO EXERCISES 


poene 
(b) z = —1.49; accept Ho. s S 


21.33 


21.34 


21.35 


[i 


(b) t = 2.57; reject Hy in favour of Hy. 
(a) #= 1.08; accept Hy Mta 2.77; value is significant 
(c) z = —1.19; accept Hp. 

(a) t = 2.59; reject do. (b) t = 1.58; accept H 

(a) t = 1.07; accept Hy (b) t = 3.53; reject Ho. o: 
(c) F = 27.97; reject Hy: p = 0. 

(a) (i) 0.32; <ii) 0.41 (b) 0.38; (c) n= 37. 
(b) 0.397; u = 0.44, correlations are homogeneous. 
0.508; u = 0.81, accept Hy. 

(a) t = 2.31; reject Hy (b) t = 2.51; njiti 

(a) t = 2.37; reject Hy. F = 5.62; reject Ho. 

(b) u = 1.64; accept Hp. Correlations are homogeneous. 
(a) rig = —0.89, r13.9 = —0.905, Ry 93 = 0.98 

(b) Correlations ere significant. 

(a) F = 1.19; accept Ho (b) F = 1.83; accept Hp. 
(b) The analysis of variance table for regression is: 


qs |w] 


e 1503.5 | 338.6 
4.44 


Residual 
~ ed 


b=0.98, s,=0.053 = t=—0.38; accept 
(b) The analysis of variance table for regression is: 


Pav Taf [8s s | Ms | FF | 
1 2083.24 | 85.34 
3 g 26 | 24.42 
a] -T 


= 


u 


Regression 
Residual 


Reject Ho: The variables ‘are re 


(i) Ŷ = 90-8X 


; ‘dual SS = 3600. 
(ii) Regression SS = L(y - ¥)? = 6400; Residua’ 


. a 
Gii)t = —8; reject Ho (iv) Adjusted y=5 


i jon is: 
The analysis of variance for regress 


Regression 
Residual 


Hy:B=1 at a=0.01 


INTRODUCTION TO STATISTICAL T 
564 HEORY 


91.88 pp =8:88; Pi” 2-09; Bo = 2.65 


ANOVA for overall significance test 


[ar | 
Regression 2 70.69 35.345 17.20 
2 2.055 sue 
a a 


Error 


TEKSE 


Accept Ho: By = Bz = 9 ate = 0.05. ` 
Chapter 22, Pp. 414-418 


22.8 The analysis of covariance table is: 


d.f.. Sum of Squares and Adjusted Results 
=] 
[ast > a e 


6. Ea 15. z 9.625 aga 
6 | 4.750 |12.750| 4.000 : 5 


10.875 | 27.875 | 13.625 10.805 
ea 


eal adjusted 1.423 1 | s,2= | 0.76 
1.424 
-b = 0.842. To test Hy: B = 0, F = 1.79; reject Ho. 
Regres- 
sion 


22.9 


S.V. ta =] 


= 7009. eal 2826.69 | —1376.12 
Within means 4674.00 | 2675.75 | 1862.75 | 0. 3985 
Total [13,688.75 | [0.0416 | 
ae eet Adjusted SS 
sy. 
Total 1 20. — 14° | 5482.17 
Error 1 a 37 aero] 38 
[9548.70 | 
“Using unadjusted y — F = 4.23; reject the nul 
Ho: ba = Hpg = Uc = Hp 


By analysis of covariance, F=6.73; reject Ho : Ha = Hg 
: after adjustment for quantity of food. 


The analysis is: 
5 


l hypothesis 


= pc = Hp 


ysWERSTO EXERCISES 


„Al . . 
710 The analysis of covariance table is: 
2 Sum of Squares and Products 


d d.f. Adjusted Results 


zamene | 3 | 2.18 | 5.50 | -0.49 
error 15 | 9.13 | 96.22 | 28.95 9.13- sided 
‘al 18 | 11.31 |101.72) 28.46 | 11.31- Se arrasa Tj | ed Re al | 
Treatments adjusted 3.35-0.4 
zrs ae 


To test Hy: B = 0; F = 290.38; reject Hy. s = 0.11 
22.11 The Analysis of Covariance table is: 


565 


Sum of Squares and Products 


Adjusted Results 


y eT Hs 


Between Groups E p 73 | 31.6 | 25.6 -- 
ithin Groups am 20 EA 8 | 34.6 | 36.30 | 11] 3.30 


Reject Hoat a = 0.05. 
92.12 The Analysis of Covariance table is: 


d.f. Sum of Squares and Products 
"Era e 


4 | 112 | -532 al E 
26 


ect Ho. 
es between means, 


To test Ho: F = 0, F = 8.52: rej 
To test Hy : there is no different 


accept Ho. 


ee 


pais 50 
24.00 
99.24 


SS 


INTRODUCTION TO STATISTICAL THEORY 


566 
To test the hypothesis of no difference in adjusted Y-values 
F = 2.33 , » 
22.15 (a) b = 0.0238: F = 67.36; reject Ho: B=0. 
(b) The analysis of covariance table then becomes 


af Aaa 
Sa ae [Be |e a] 


T+E 144685.4 | 136.09 4280.25 9.47 35 E 
E 28665.1 23.23 | 682.20 6.99 29 |0.241 
Treatments adjusted | 2.48 | 6 [0.413] 


F=1.71; the differences among the treatment means for Y 
after adjusting for variation attriuted to X are not significant. 


Chapter 23, Pp. 464-476 


23.4 (b) The ANOVA-Table is: 
ev ar s ms 
Between Levels 2 7268 3634 
Within Levels 493 8.2=41 
S.E. = 2.87 
23.5 The analysis is: 
Between Varieties : ; 
Within Varieties 28 174.0 6.21 --- 
asf- e 
i The data do not indicate significant difference. 
236 Hypothesis Ho : There is no difference in the effects of 
storage conditions, i.e., Hg: = = = = 
„Le, Hot by = H= Ha = Wy = 
The ANOVA-Table is: : 
Accept Ho. fn 
23.7 The analysis of variance table is: 


Between Formats 


Within Formats 


Significance level = 0.001. 


| 


ee 


"crs TO EXERCISES 
567 


wysWeR 
93.9 (b) The ANOVA-Table is: 


Between Varieties’ 
Between Replication 
Error 


Accent the hypotheses. 
0 (b) The ANOVA-Table is: 


Sore 


Treatments 3 |- 168 56 3.382 
Blocks 12 24 

9 152 16.89 -- 
ta EC 


Error 
Accept the hypotheses. 
(b) If no blocking had been done, then the ANOVA-Table would 


become 
56 
18.67 


93.1 


Between Treatments 
Within Treatments 


| Accept Ho. 
23.11 (a) The ANOVA-Table is: 


24942.1 


149700.4 29940.1 
1872.5 


Treatments 
Blocks 
Error 


F = 13.3: reject Ho: 
(b) The ANOVA-Table 15: 


24942.1 
169425.0 | 112280 


ans are significantly 


j e t Ho 
F = 2.22: ee g The treatment me 


23.12 (b) F = 7-673 i 
different. ai ta = 0-095 
Jocks) = 8.30; significant ® a = 0.09: 
23.13 (a) F (blo 6.11; significant at 


F (fertilizers) 5 


568 INTRODUCTION TO STATISTICAL THEORy 
(b) i) F= 17.37; significant at & = 0.05. 
(ii) F = 0.96; not significant. 


23.14 The analysis of variance is: 
rae | 
Treatments 187035 
Blocks 220563 
Error 17716 
23.15 (a) 


Treatments 


Error 


(b) S.E. for one treatment mean = 0.048; and 
S.E. for difference between two treatment means = 0.068 
(c) LSD = 0.140. The treatment means are arranged in 


ascending order and a line is drawn under the set for means 
that are not significantly different. 

Yu T2 Ts Ti Tı 

1.464 1.662 


j: 
1.195 1.325 


(d) 122% 
23.17 (b) x = 6. The analysis is: 


Between Tests 4 
2 
7 


Between Ages 
The missing values are y4, = 15.8, yoo = 14.2, Y24 
The analysis would be 


Error 


23.18 


19.45 
2.55 
6.64 


Blocks 


Treatments 


Error 


= 13.9. 


ANSWERS TO EXERCISES 


23.19 
23.22 


23.23 


23.24 


23.25 


23.26 


23.27 


(b) m = 6.5. Treatme 569 


nt MS = 
(b) The ANOVA-Table is S = 6.66 and Error MS = 0.51. 


Rows 


Columns 
Treatments 
Error 


[35 | 3a T ._| 
a Varieties are significantly different at the 5 percent level 
i ) yd - vel, 
- creer re the variation among districts is significant 
= 4./2; no significant variati i ) 
F (tests) = 3.09: no diff, se cade ami 
. l ; erence between sensitivity of the tests 
he analysis of variance table is: o 


Positions 
Runs 
Treatments 
Error 


The grades of leather are significantly different. 

F = 20.36. LSD = 15.57. The treatment No. 1 is to be 
adjusted by 27 to get a uniform product. . 
The analysis of variance table is: 


Columns 183.7584 45.9396 


Rows 141.0784 85.2696 
Treatments 348.2384 87.0596 
804.0952 25.3413 


Error 
Reject Hp. Treatment means differ significantly. 


(a) The analysis of variance table is: 


Columns 
Rows 
Treatments 


Error 
e variation due to treatments 
ance table is: 


Th ! 
(b) The analysis of vari 


Treatments 
Error 


INTRODUCTION TO STATISTICAL THEO 
Ry 


23.41 


570 
. Reject Ho. Y, i ; 
23.28 F (periods) = 9.19; significant difference in periods. 
F (conditions) = 6.62; conditions art significantly different, 
F (monkeys) = 17.09; significant difference in non? vys, 
F (comparison) = 2.23; data fail to support the conjecture, 
23.37 The analysis is: 
SS MS 
594.05 
2.45 
23.38. The ANOVA-Table is: 
Columns 
Rows 
Treatments 
342.25 
182.25 
23.39 Bait ae eae 
ble ee (being regarded as trivial) mean squares 
used to provide an estimate of error mean s nd 
; quare. 
23.40 The ANOVA-Table is: 
Replication 
Treatments 
SSA = 2.6667, SS 


B 


5 = 170. = 
_ SS(AB)=1.5, sscac 6667, SSC = 104.1667. 


)= 42.6667, SS(BC) = 0, SS(ABC) = 1° 


A 
23.42 


24.3 
24.4 
24.5 


24.6 
24.9 


24.10 
24.11 
24.12 
24.15 
24.16 
24.17 
24.18 


24.19 
24.20 
24.21 
24.22 
24.23 
24.24 


A.7 
A.8 


Ad 
A.11 
A.12 
A.13 
A.14 
A.17 


A.18 
A.19 
A.20 


NSWERS TO EXERCISES 


(b) The covariance table is: an 


Regression 


364.19 390.37 F 
86.01 5 62.07 
3 

Chapter 24, Pp. 510-515 
(b) (i) z = —1.02; accept Hg. (ii) z = -1.17; accept Ho. 
(b) (i) z = —0.95; accept Ho. (ii) T = 13; accept Ho. 
(i) P (X<2) = 0.1445; accent Ho. (ii) T = 6; accept Hy. 
(h) x? = 1.6; accept Ho. 
R x 26; accept Ho. The difference between the two types of 
paint is not significant. 


z = —1.84; reject Ho. 
(a) U = 7; accept Họ (b) z = —0.56; accept Ho. 
z = —1.84; reject Ho. 
(b) x2 = 2.33; accept Ho. 
No significant difference. 
%2 = 3.877, medians are equal. 
(a) There exists nonrandomness. 


(b) z = 1.78 ; accept the hypothesis of arandom sample. 
-0.51; accept Ho. 


1 


(a) n, = 8; sequence is random. (b) z = 
(b) D = 0.24; accept Ho. 

D = 0.1485; accept Hy. 
(a) D = 0.36; accept Ho (b) x2 = 16.51; reject Ho. 
H = 11.32; reject Ho. 

= q . pol g 
a sale cn TO APPENDIX A 
19.5, 232.9, 33.0, 6.1, 3.1. 4.8, 6.0, 8.6, 11.5, nee . 
Males: 13.20, 18.90, 9.28, 14.16, 20.28, ae : 
Females: 16.09, 20.95, 11.12, 16.86, 23.20, 38.38, 45.40. 


(b) 11,34, 108, 288, 582, 1184. 
(b) (i) 18.85, 16.77 (ii) 16.78 


D.R. = 13.4, S.D.R. = 14.0. 
ie . = 13.32, S.D.R. = 9.02 


AS.E.R. = 19.41, 


T F.R=2704 per 1,000 women 
b) 2.385 N 

= G.R.R. = 1.071, N.R.R. = ag 
(b) G RR. = 4,512, N.R.R. 2.567. 


A 


a, 126 
Acceptance Region, 125 
Allocation of sample sizes, 17 
equal, 17 
Neyman, 18 
proportional, 18 
optimum, 18 
Alternative Hypothesis, 123 
Analysis of Covariance, 397 
assumptions in, 413 
introduction, 397 
‘for estimation of missing 
observation, 438 
LSD-test, 401, 410 
models of, 411 
one-way analaysis and 
partitioning the sum of 
products, 398 
two-way analysis, 405 
uses of, 414 
Analysis of Variance, 295 
assumptions in, 310 
for linear regression, 380 
for multiple regression, 384 
introduction, 295 
LSD-test, 324 


INDEX 


LS estimate of effects, 332, 334 


models, 331 

partitioning SS, 298 
partitioning df, 300 
one-way analysis, 296, 305 
two-way ANOVA, 311, 312 


two-way ANOVA with 
interaction, 318 
several observations per 
cell, 318 

table of, 301, 315 


Area sampling, 22 
Association, of attributes, 205 
co-efficient of, 205 
dissociation, 205 
measure of, 205 
Assumptions 
in analysis of covariance, 413 
in analysis of variance, 310 
in F-distribution, 228 
in t-distribution, 245 
Attributes, 200 
association of, 205 
consistence, 203 
independence, 204 
ultimate class-frequencies, 201 


B 


B, 126 
Bartlett’s test, 185 
Basic Experimental Designs 
(See experimental designs) 
Behrens-Fisher test, 258 
Bias, 16 
Binonial Distribution, 50, 159 
BLUE, 78 
Brandt-Snedecor formula, 220 


C 
Central Limit Theorem, 30, 41 
Chi-Square Distribution, 169 
derivation of, 169 
properties of, 171 
table of, 175 
tests based on, 181 
Chi-square tests in con 


tables, 207 + 
of equality of variances, 1 


of goodness of fit, 194, 195 


tingency 


574 
of homogeneity, i 
lity of severa 
A Tia co-efficients, 376 
of equality of several 
proportions, 216 
of independence, 200 
in two-by-two tables, 211, 215 
about variance, 181 
of p’s in multinomial distribu- 
tion, 190 
Cluster, 22 
Cluster Sampling, 22 
Co-efficient, of association, 205 
of colligation, 206 
_ of confidence, 94 
“of contingency, 213 
Completely Randomised Designs, 
421 aye 
advantages and disadvantages, 
424 
experimental layout, 421 
statistical model and analysis, 
422 | 
Composite Hypothesis, 124 
Concomitant variable, 397 
Confidence Intervals (limits) 
definition of, 94 l 
for a, 852 
for correlation coefficients, 368 
for difference of means, 103,247 
for mean, 95, 246 
for mean value, 352 
for Poportion, 107 
=h difference in proportions 
s Ss co-efficient, 350 
for i i i z i 
population, 178 ae 
one-sided, 119. 247 


INTRODUCTION TO STATISTICAL THEO 
for several sample Varian x 
180 
for variance ratio, 289 
interpretation of, 101 
Consistence, 203 
Consistent estimator, 76 
Contingency, tables, 207 

co-efficinet of, 213 

exact test for, 215 
Continuity, correction for, 

Yates’ correction, 214 
Contrasts, 33¢ 

single degré. of freedom, 453 
Correlation, significance of, 370 
Covariance analysis 

(see anlaysis of covariance) 
Cramer-Rao inquality, 78 
Criteria for Estimators, 69 
Critical region, 125 


Ces, 


D 
Degrees of Freedom, 169, 208 
Designs, Experimental (see 
experimental design) 
Dichotomy, 201 
Difference between means 
confidence interval for, 103,247 
sampling distribution of, 44 
testing hypothesis, 143 
Difference between Proportions 
confidence interval for, 108 
sampling distribution, 54 
testing hypothesis, 152 
Dissociation (see association) 
Distribution, binomial, 50, 159 
chi-square, 169 
F-, 278 
Fisher’s z-, 273 
normal, tests based on, 
t-, 239 
Double Sampling, 23 


137 


INDEX 


Duncan’s Multiple Range 
Test, 328 


E 


Efficiency of 
latin square design, 451 
randomized block design, 440 
Efficient Estimator, 77 
Errors of 
first kind, 125 
mean square, 78 ; 
sampling and non-sampling, 5 
second kind, 125 
Error Sum of Squares, 299 
Estimate, 67 
Estimation, 67 
interval, 68 
linear, 68 
method of, 85 
point, 68 
Estimation of Missing 
observations by covariance,438 
in complete block experiment, 
434 
in latin square experiment, 
449 
Estimator, 20, 67 
criteria for, 69 
Exact Test for Independence, 215 
Expected Mean Square, 334 
Experimental Designs, 419 
basic principles, 420 
completely randomized design, 
421 
factorial design, 454 
Graeco-latin Square, 451 
introduction, 419 
latin square design, 
local control, 421 
domization, 


441 


ran 


575 
randomized complete block ` 
design, 426 
replication, 420 


F 


F-distribution, 273 


assumptions in, 278 
derivation of, 274 
introduction, 273 
properties of, 276 

tables of, 278, 279, 280, 281 
test based on, 284 


Factorial Experiments, 454 


advantages and disadvantages, 
463 
design and analysis, 458 
main effects and interaction, 
455 ; 
2 2—factorial experiment, 455 
93_factorial experiment, 457 
Yates’ technique. 462 
Finite correction factor, 28 
Finite population, 1 
correction for, 28 
Fisher’s exact test, 215 
Fisher-Neyman Factorization 
critierion, 80 
Fisher’s z-distribution, 273 
Fisher's z-transformation, 368 
Fixed-effects model, 331, 411 


G 


Goodness of Fit Tes 
Graeco-Latin Squares, 


t, 194, 195 
451 


Homogeneity; 
of p-variances, 
of several estimate 


correlation coefficients, 376 


574 
of homogeneity, 219 
of equality of several 
correlation co-efficients, 
of equality of several 
proportions, 216 
of independence, 200 
in two-by-two tables, 211, 215 
about variance, 181 
of p’s in multinomial distribu- 
tion, 190 
Cluster, 22 
Cluster Sampling, 22 
Co-efficient, of association, 205 
of colligation, 206 
of confidence, 94 
“of contingency, 213 
Completely Randomised Designs, 
421 ae 
advantages and disadvantages, 
424 
experimental layout, 421 
statistical model and analysis, 
422 
Composite Hypothesis, 124 
Concomitant variable, 397 
Confidence Intervals (limits) 
definition of, 94 


for a, 352 


for correlation coefficients, 368 

for difference of means, 103,247 

for mean, 95, 246 

for msan value, 352 

for Poportion, 107 

for di i 

fe ifference in proportions 
for regression co-efficient, 350 
Psi small samples, 246 

or variance of a no 

Ë 

population, 178 a 
one-sided, 119. 247 


376 


INTRODUCTION TO STATISTICAL THEO 
for several sample varian 3r 
180 
for variance ratio, 282 
interpretation of, 101 
Consistence, 203 
Consistent estimator, 76 
Contingency, tables, 207 

co-efficinet of, 213 

exact test for, 215 
Continuity, correction for 

Yates’ correction, 214 i 
Contrasts, 33¢: 

single degré. of freedom, 453 
Correlation, significance of, 370 
Covariance analysis 

(see anlaysis of covariance) 
Cramer-Rao inquality, 78 
Criteria for Estimators, 69 
Critical region, 125 


Ces, 


D 
Degrees of Freedom, 169, 208 
Designs, Experimental (see 
experimental design) 
Dichotomy, 201 
Difference between means 
confidence interval for,103,247 
sampling distribution of, 44 
testing hypothesis, 143 
Difference between Proportions 
confidence interval for, 108 
sampling distribution, 54 
testing hypothesis, 152 
Dissociation (see association) 
Distribution, binomial, 50, 159 
chi-square, 169 
F-, 278 
Fisher’s z-, 273 
normal, tests based 
t-, 239 
Double Sampling, 23 


on, 137 


INDEX 
Duncan’s Multiple Range 
Test, 328 
E 
Efficiency of 


latin square design, 451 
randomized block design, 440 
Efficient Estimator, 77 
Errors of 
first kind, 125 
mean square, 78 
sampling and non-sampling, 5 
second kind, 125 
Error Sum of Squares, 299 
Estimate, 67 
Estimation, 67 
interval, 68 
linear, 68 
method of, 85 
point, 68 
Estimation of Missing 
observations by covariance,438 
in complete block experiment, 
434 
in latin square experiment, 
449 
Estimator, 20, 67 
criteria for, 69 
Exact Test for Independence, 215 
Expected Mean Square, 334 
Experimental Designs, 419 
basic principles, 420 
completely randomized design, 
421 
factorial design, 454 
Graeco-latin Square, 451 
introduction, 419 
Jatin square design, 441 
Jocal control, 421 
randomization 


SA 


POSETE 


randomized complete block 
design, 426 


replication, 420 


F 


F-distribution, 273 


assumptions in, 278 
derivation of, 274 
introduction, 273 
properties of, 276 

tables of, 278, 279, 280, 281 
test based on, 284 


Factorial Experiments, 454 


advantages and disadvantages, 
463 
design and analysis, 458 
main effects and interaction, 
455 
2 2_factorial experiment, 455 
93—factorial experiment, 457 
Yates’ technique. 462 _ 
Finite correction factor, 28 
Finite population, 1 
correction for, 28 
Fisher’s exact test, 215 
Fisher-Neyman Factorization 
critierion, 80 
Fisher's z-distribution, 273 
Fisher’s z-transformation, 368 
Fixed-effects model, 331, 411 


G 
t Test, 194, 195 


ess of Fi 
Goodn 451 


Graeco-Latin Squares, 


H 
Test for, 219 
es, 185 


mated — 
fficients, 


Homogeneity; 
of k-varianc 
of several esti 

Jation coe 


376 
corre 


| 


574 
of homogeneity, 219 
of equality of several 
correlation co-efficients, 
of equality of several 
proportions, 216 
of independence, 200 
in two-by-two tables, 211, 215 
about variance, 181 
of p’s in multinomial distribu- 
tion, 190 

Cluster, 22 

Cluster Sampling, 22 


376 


- Co-efficient, of association, 205 


of colligation, 206 
_ of confidence, 94 
“of contingency, 213 
Completely Randomised Designs, 
421 g ai 
advantages and disadvantages, 
424 : 
experimental layout, 421 
statistical model and analysis, 
422 
Composite Hypothesis, 124 
Concomitant variable, 397 
Confidence Intervals (limits) 
definition of, 94 
for a, 352 


for correlation coefficients, 368 
for difference of means, 103,247 
for mean, 95, 246 
for mean value, 352 
for Poportion, 107 
for difference in 
108 
for regression co-efficient, 350 
for small Samples, 246 l 
for variance of a normal 
population, 178 
one-sided, 119. 247 


Proportions 


INTRODUCTION TO STATISTICAL TH 
ati sample Variances, 
for variance ratio, 289 
interpretation of, 101 
Consistence, 203 
Consistent estimator, 76 
Contingency, tables, 207 
co-efficinet of, 213 
exact test for, 215 
Continuity, correction for, 
Yates’ correction, 214 
Contrasts, 33¢ 
single degré. of freedom, 453 
Correlation, significance of, 370 
Covariance analysis 
(see anlaysis of covariance) 
Cramer-Rao inquality, 78 
Criteria for Estimators, 69 
Critical region, 125 


D 
Degrees of Freedom, 169, 208 
Designs, Experimental (see 
experimental design) 
Dichotomy, 201 
Difference between means 
confidence interval for,103,247 
sampling distribution of, 44 
testing hypothesis, 143 
Difference between Proportions 
confidence interval for, 108 
sampling distribution, 54 
testing hypothesis, 152 
Dissociation (see association) — 
Distribution, binomial, 50, 159 
chi-square, 169 
F-, 278 
Fisher’s z-, 273 
normal, tests based on, 
t-, 239 
Double Sampling, 23 


137 


EOry 


a ae a 


INDEX 
Duncan’s Multiple Range 
Test, 328 
. E 
Efficiency of 


latin square design, 451 
randomized block design, 440 
Efficient Estimator, 77 
Errors of 
first kind, 125 
mean square, 78 
sampling and non-sampling, 5 
second kind, 125 
Error Sum of Squares, 299 
Estimate, 67 
Estimation, 67 
interval, 68 
linear, 68 
method of, 85 
point, 68 
Estimation of Missing 
observations by covariance,438 
in complete block experiment, 
434 
in latin square experiment, 
449 
Estimator, 20, 67 
criteria for, 69 
Exact Test for Independence, 215 
i Expected Mean Square, 334 
t Experimental Designs, 419 
: basic principles, 420 
i completely randomized design, 
j 21 
il design, 454 
| Graeco-latin Square, 451 
introduction, 419 
| latin square design, 441 
local control, 421 
randomization, 42 


poe 575 
randomized complete block 
design, 426 


Teplication, 420 


F 
F-distribution, 273 
assumptions in, 278 
derivation of, 274 
introduction, 273 
properties of, 276 
tables of, 278, 279, 280, 281 
test based on, 284 
Factorial Experiments, 454 
advantages and disadvantages, 
463 
design and analysis, 458 
main effects and interaction, 
455 i 
2 2—factorial experiment, 455 
23—factorial experiment, 457 
Yates’ technique. 462 _ 
Finite correction factor, 28 
Finite population, 1 
correction for, 28 
Fisher’s exact test, 215 
Fisher-Neyman Factorization 
critierion, 80 
Fisher's z-distribution, 273 
Fisher’s z-transformation, 368 
Fixed-effects model, 331, 41 1 


G 


Goodness of Fit Te 
Graeco-Latin Squares, 


H 


Homogeneity, Test for, 219 
of k-variances, 185 


of several estimated 
tion coefficients, 


st, 194, 195 
451 


376 
correla : 


Homoscedasticity, 311 
Hypothesis, alternative, 123 
composite, 124 
exact, 124 
formulation of, 135 
general procedure for testing, 
137 
inexact, 124 
null, 123 
simple, 124 


I 
Independene, of attribute, 204 
Interaction, 311 


J 
Jacobian, 241, 275 
Judgement Sample, 23 


K 


Karl Pearson’s approximation, 
187 


L 
Lagrange’s multiplier, 333 


' Large samples, 29 


Latin Square Designs, 441 
advantages and disadvantages, » 
446 
construction and layout, 442 
efficiency in, 451 
estimation of missing data in, ` 
449 
graeco-latin squares, 451 
LSD, 446 
orthogonal latin squares, 451 
standard squares, 442 


statistical model and analysis 
443 


INTRODUCTION TO STATISTICAL THEORY 


576 


- Least Significance Difference t 


324 
Least Squares Estimates of effects 
_ 3832 ; 
Level, of confidence, 94 

of significance, 129 
Likelihood function, 87 
Linear additive model, 331 
Linear Regression, test for, 357, 

365 

on one variable, 380 

on several variables, 384 
Local Control, 421 


M 
Method of 
maximum likelihood, 85 
moment, 92 
least squares, 93 
Mean Square Error, 78 
Minimum Variance Estimator, 78 
Missing data, 434 
Models, linear additive, 331 
analysis of covariance, 411 
analysis of variance, 332 
Moment Generating function, 30 
Multiple Comparisons Tests, 324 
Multiple correlation, significance 
test of, 379 
Multiphase Sample, 23 
Multistage Sample, 22 


N 
Neyman Allocation, 18 
Non-parametric tests, 477 
Kolmogorov-Smirnov tests, 
502 
Krushal -- Wallis H test, 506 
Mann-Whitney U test, 491 
median test, 497 
runs test for randomness, 498 


| 
| Null Hypothesis, 123 
"y 
} 


NOEX _—— -— 


Eign test, 478 
Wilcoxon rank-sum test, 487 
Wilcoxon signed-rank test, 483 
Non-prot ability Sampling, 5, 23 
Non-response, 4 
Non-sampling error, 5 
Normal Distribution, 
wea under, 545 
table of, 544 
tests based on, 137 


One-tailed Test, 130 

One-way Classification, 296 

Operting Characteristic Curve, 
129 ` . 

Optimum Allocation, 18 

Orthogonal Latin Squares, 451 


P 


Paired observations, 259 

Parameter, 1 

Į artial correlation, 
test of, 377 

Partitioning of Sum of Squares 


and D.F. © l 
in analysis of covariance, 398 


in analysis of variance, 298 
Point Estimation, 68 
| criteria for, 69 
methods of, 85 


d Estimate, 
Foole or more samples; 


significance 


82 


distribution 1 


Sampled or target, 2 


577 


Prediction Interval, 355 
Probability Sampling, 5, 9 
Proportional Allocation, 18 
Power, curve, 129 

of a test, 128 
Purposive Sampling, 5, 23 


Q 


Quota Sampling, 5, 24 
R 
Randomized Complete Blocks 
Design, 426 
advantages and disadvantages, 
429 - 

- efficiency in, 440 
es:imatioin of missing data 1n, 
434 
experimental layout, 426 
statistical model and analysis, 
427 
with replications W! 
blocks, 433 

Random Effects 

Random Number Table, 7,8 

Random Sampling, simple, 5, k 

Region of Rejection (see critica 

region) l 


thin . 


Modol, 352 


a variable, 
Regression, linear, one varia 
380 
ables, 384 
several variab i 


Relative Efficiency» 440, 
Replication, 20° 
Residual, 381 


S 
Sample, l ‘ 4 
Sample Design 
area, 2 


fi | `N 


INTRODUCTION TO STATISTICAL THEORY ' INDEX b E ; 
SA Snedecor’s F-distribution (see F- ` ‘about equality of several 579 
cluster, 22 distribution) correlations, 376 


Variance, Analysis of (see ar 
_ Of vi.riance) 
Variance Ratio, 273 


multiphase, 23 
multi-stage, 22 
non-probability, 5, 23 
probability, 5, 9° 
purposive, 5, 23 


‘about differences between 
means, 143, 249, 257, 296 
chatieieal Mbdela (eae models) about proportion, 149. 


Stratified Random Sampling, 5, i about differneces between two 
17 ' proportions, 152 


Standard Error, 25 


Statistic, 1 alysis 


quota, 5, 24 
sequential, 23 


Student-Newman-Keuls Multiple 


Range Tst, 327 


about equality of two standard 
deviations, 156 


Y 


` Yule’s co-efficient of association, ` 


‘ 5 ‘ 205 
simple random, 5, 9 Student’s t-distribution (see t- _about equality of two Yates! ; ; 
steatified random, 5, 17 Jistribution) variances, 284 ates’ correction for continuity, 
ini 214 
-systematic random, 5, 21 s : about equality of k variancë 
ub-sampling, 22 y variances, Yates’ . 
. 133 , Yates’ technique for contrasts, 
Sample size, 111, 113, Sufficient Estimator, 89 185 462 a a 


Sainple Survey; 4 
Sampling, advantages of, 3 


abot homogeneity, 219 


Systematic Sampling, 5, 21 as 
about independence, 207 


bias, 6 

error, 5 

introduction, 1 

units, 1 

with and without replacement, 
5 


Sampling Distribution, 24 


of a, 349 
of b, 348 
of the means, 25 


of differences between means, 


44 
of a proportion, 49 
of differences between 
proportions, 54 
of variance, 55 
of ¥, 349 
Sampling Frame, 4 
Scheffe’s method, 330 - 
Several Sample Variances, 180 
lignficance, level, 129 
tests of, 129 
imple Hypothesis, 124 
ingle Degree of Freedom 
Contrasts, 453 


sory’s, method, 211 


T 


`- t-distribution, 239 ` 


assumptions in, 245 
definition of, 239 
derivation of, 240 
introduction, 239 

for difference between two 
means, 245 

paired test, 259 

properties of, 242. 

tables of, 243, 2-4 

tests based on, 249 


Target Population, 1 
Tests of signficance, 129 
Testing of Hypothesis, 67, 123 


incroduction, 123 

about a, 361 

about. B, 357 

genera! procedure of, 137 
avout correlation co-efficients, 
870, 371, 374 © 

about equality of regression 
etficients,362, 372 


about linear regression, 357, 
365 - 
about means, 139, 141, 142, 
249 i 
about mean value H yx, 362 
about multipie correlation, 379 
about partial correlations, 377 
_abcut regression co-efficient, 
357 
aLovut variance, 181, 362 
of standard deviation, 156 
based on binomial dist., 159 
based on normal distribution, 
137 i 
based on small samples, 249 
Test-statistic, 124 
"'wo-phase sampling, 23 
'wo-tailed test, 130 
Two-way Classification, 296 
Type I and Type II Errors, 125 


U 


Unbiased Estimator, 


definition, 70 
of population variance, 71 


Z 
z2-distribution, 273° * 
Z-tests, 137—158 


` z-transformation of r, 368 


jor B.A. 050 dents, i 
me | g 3 
f Meth ae 
ta i Or SXM Yul Abd Me Pl Ch homed Anin 


ee beat E Realyt’ Geometry. Text Book, ae 

FRA k eo geen on o ; mi: 
Key. to above | ST i a oe : 
“Antrod nes) Text Booth 2 


